{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Variational Inference for Gaussian Process Classification Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### By Wu Lin - <a href=\"mailto:yorker.lin@gmail.com\">yorker.lin@gmail.com</a> - <a href=\"https://github.com/yorkerlin\">https://github.com/yorkerlin</a> Suggested by Heiko Strathmann and Mohammad Emtiyaz Khan "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Based on the notebook of Gaussian Processes by Heiko Strathmann - <a href=\"mailto:heiko.strathmann@gmail.com\">heiko.strathmann@gmail.com</a> - <a href=\"https://github.com/karlnapf\">github.com/karlnapf</a> - <a href=\"http://herrstrathmann.de\">herrstrathmann.de</a>, and the GP framework of the Google summer of code 2014 project of Wu Lin,  - Google summer of code 2013 project of Roman Votyakov - <a href=\"mailto:votjakovr@gmail.com\">votjakovr@gmail.com</a> - <a href=\"https://github.com/votjakovr\">github.com/votjakovr</a>, and the Google summer of code 2012 project of Jacob Walker - <a href=\"mailto:walke434@gmail.com\">walke434@gmail.com</a> - <a href=\"https://github.com/puffin444\">github.com/puffin444</a> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A <a href=\"http://en.wikipedia.org/wiki/Gaussian_process\">Gaussian process</a> (GP) model is a flexible model, which can be used to do <a href=\"http://en.wikipedia.org/wiki/Regression_analysis\">regression</a>, <a href=\"http://en.wikipedia.org/wiki/Statistical_classification\">classification</a>, <a href=\"http://en.wikipedia.org/wiki/Dimensionality_reduction\">dimension reduction</a>, and <a href=\"http://en.wikipedia.org/wiki/Deep_learning\">deep learning</a>. A GP model is a <a href=\"http://en.wikipedia.org/wiki/Category:Non-parametric_Bayesian_methods\">Bayesian non-parametric model</a> using <a href=\"http://en.wikipedia.org/wiki/Kernel_method\">kernel methods</a>.\n",
    "This notebook describes how to do <a href=\"http://en.wikipedia.org/wiki/Variational_Bayesian_methods\">variational inference</a> for Gaussian Process Classification (GPC) models in Shogun. In a GPC model, a set of GP functions are used to predict a discrete label given its features of a data point.\n",
    "\n",
    "For the theory about GPC, we assume that readers have some background in <a href=\"http://en.wikipedia.org/wiki/Bayesian_statistics\">Bayesian statistics</a> and basic knowledge of Gaussian Processes. For the background, please see the <a href=\"http://www.shogun-toolbox.org/static/notebook/current/gaussian_processes.html\">notebook</a> or <a href=\"http://nbviewer.ipython.org/github/AM207/2015/blob/master/Lectures/Lecture21_GaussianProcesses.ipynb\">another notebook</a> about Gaussian Processes. After providing a semi-formal introduction, we illustrate how to do training, make predictions, and automatically learn parameters of GPC models (model selection) in Shogun."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='toc'>Table of Content</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* <a href='#overview'>Overview</a> \n",
    "    * <a href=\"#benefit\">Benefits of Using GP Models in Shogun</a>\n",
    "    * <a href=\"#roadmap\">Roadmap of GP Models in Shogun</a>\n",
    "* <a href=\"#intro\">Brief Introduction</a>\n",
    "* Theory (Skip if you just want code examples)\n",
    " * <a href=\"#bigpic\">Big Picture of Variational Inference for GPC Models</a>\n",
    " * <a href=\"#detailbg\">More Detailed Background about GPC Models</a>\n",
    " * <a href=\"#VGI\">Variational Gaussian Inference for GPC Models</a>\n",
    " * <a href=\"#nonclosedForm\">Dealing with the Non-closed Form Issue in GPC Models</a>\n",
    " * <a href=\"#bgvb\">Some Background of Variational Bounds in GPC Models</a>\n",
    " * <a href=\"#difficulty\">Difficulty of Scaling up standard GPC Models</a> \n",
    " * <a href=\"#sparseGP\">Big Picture of Sparse GPC Models for Large Scale Inference</a>  \n",
    " * <a href=\"#ref\">References</a>\n",
    "* Summary (Shogun's API) \n",
    " * <a href='#likelihood'>Likelihood Classes for GPC Models in Shogun</a>\n",
    " * <a href=\"#inference\">Inference Methods for GPC Models in Shogun</a>\n",
    " * <a href='#opt'>Optimizers Used in Inference Methods (for advanced users)</a>\n",
    " * <a href='#linesearch'>Line Search Methods Used in LBFGS Optimizer (for advanced users)</a>\n",
    "* Code Examples\n",
    " * <a href='#toy'>A Toy Example of Variational Gaussian Inference</a>\n",
    " * <a href='#visu'>An Example for Visualization (banana dataset)</a>\n",
    "     * <a href=\"#largescale\">A Glimpse of Large Scale Inference Methods</a> \n",
    "     * <a href=\"#optInd\">Optimizing Inducing Points</a>       \n",
    " * <a href=\"#realeg1\">A Real-World Example (sonar dataset)</a>\n",
    " * <a href=\"#realeg2\">Another Real-world Example (USPS dataset)</a>\n",
    "     * <a href='#laplace'>Using Laplace Method</a>\n",
    "     * <a href='#covv'>Using Covariance Variational Method</a>\n",
    "     * <a href='#meanfield'>Using Mean-field Variational Method</a>\n",
    "     * <a href=\"#dual\">Using Dual Variational Method</a>\n",
    "     * <a href=\"#chol\">Using Cholesky Variational Method</a>\n",
    "     * <a href=\"#bounds\">Likelihood Classes Supported Variational Bounds</a> \n",
    "     * <a href='#parameters'>Optimizing Parameters of GPC (Model Selection)</a>\n",
    "* <a href='#todo'>Soon to Come</a>\n",
    " * More Large Scale inference methods for GPC Models\n",
    " * Latent GP Models for Dimension Reduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "# import all shogun classes\n",
    "import os\nSHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')\n",
    "from shogun import *\n",
    "\n",
    "# import all required libraries\n",
    "import scipy\n",
    "import scipy.io\n",
    "import numpy as np\n",
    "from math import exp,sqrt,log\n",
    "import random\n",
    "import time\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.cm as cm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='overview'>Overview</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id=\"benefit\">Benefits of Using GP Models in Shogun</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Unified API across <font color='red'>10</font> major languages (Python, R, Octave, Java, and more)\n",
    "* Better performance (written in C++ and parallel computing) compared to <a href=\"http://scikit-learn.org/\">scikit-learn</a> and <a href=\"http://www.gaussianprocess.org/gpml/code/\">GPML</a> (TODO adding some speedup statistics)\n",
    "* More (<font color='red'>6</font> metnods) medium-scale variational inference methods compared to scikit-learn and GPML\n",
    "* More (<font color='red'>8</font> methods, including 5 ongoing) large-scale variational inference methods compared to scikit-learn and GPML\n",
    "* Efficient implementation of more than <font color='red'>25</font> kernels (covariance functions)\n",
    "* Supporting GP multi-classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id=\"roadmap\">Roadmap of GP Models in Shogun</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Inference methods for Standard GP models\n",
    "    * Exact method for GP regression models\n",
    "    * Variational methods for GP classification models\n",
    "    \n",
    "* Inference methods for sparse GP models (also known as large scale inference methods)\n",
    "    * Fully Independent Training Conditional (FITC) methods for GP regression models\n",
    "    * FITC Laplace methods for GP classification models\n",
    "    * FITC KL methods for GP regression and classification models \n",
    "    * Parallel/Distributed Variational Inference for GP regression and classification models (Ongoing)\n",
    "    * Stochastic Variational Inference for GP regression and classification models (Ongoing)\n",
    "    * Nonparametric Variational Inference for GP regression and classification models (TODO)\n",
    "\n",
    "* Beyond GP models for regression and classification\n",
    "    * GP latent variable models for dimension reduction (Ongoing)\n",
    "    * Deep GP models for deep learning (TODO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='intro'>Brief Introduction</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we mainly focus on binary classification problems. It is not hard to exend the presented methodology to multi-classifcation problems.   Given a binary classification problem, binary labels and features are respectively defined as a column vector, $\\mathbf{y}\\in\\mathcal{Y}^n=\\{-1,+1\\}^n$, and a matrix, $\\mathbf{X}\\in\\mathcal{R}^{n\\times d}$, where $n$ is the number of data points and $d$ is the number of features.  A likelihood function $p(\\mathbf{y}|\\text{f},\\mathbf{X})=p(\\mathbf{y}|\\mathbf{f})$ is used to model data with labels, where a latent function $\\text{f}:\\mathcal{R}^{d}\\to \\mathcal{R} $ is drawn from a GP prior and $\\text{f}(\\mathbf{X})=\\mathbf{f} \\in \\mathcal{R}^{n}$ is a column vector by applying $\\text{f}$ to each row of $\\mathbf{X}$. Given data with labels, our goal is to train a GP classifier ($p(\\text{f}|\\mathbf{y},\\mathbf{X} $)) to fit the data well and predict new data points ($p(y_\\text{new}|\\mathbf{y},\\mathbf{X},\\mathbf{x}_\\text{new})$). The following sections cover detailed background of variational inference for GPC models.\n",
    "\n",
    "Note that the difference between $\\text{f}$ and $\\mathbf{f}$ is that $\\text{f}$ is a function while $\\mathbf{f}$ is a n-dimensional vector. The GP prior, $p(\\text{f})$, and $p(\\mathbf{f})=p(\\text{f}(\\mathbf{X}))$ are also different because $p(\\text{f})$ is a distribution in an infinite dimensional (function) space while $p(\\mathbf{f})$ is a distribution in a finite dimensional real space.\n",
    "\n",
    "\n",
    "To gain some intuition how these GPC models behave, and how well variational Gaussian inference methods approximate posteriors, please see section <a href='#toy'>A Toy Example of Variational Gaussian Inference</a> and section <a href='#visu'>An Example of for Visualization</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='bigpic'>Big Picture of Variational Inference for GPC Models</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to learn posterior, $p(\\text{f}|\\mathbf{y},\\mathbf{X}) \\propto p(\\mathbf{y}| \\text{f},\\mathbf{X}) \\times p(\\text{f}|\\mathbf{X}) $, given training data points. Note that $p(\\text{f})$ is the prior over the GP functions. We will see in <a href='#detailbg'>next section</a> that all we need is to learn $p(\\mathbf{f}|\\mathbf{y})=p(\\text{f}(\\mathbf{X})|\\mathbf{y},\\mathbf{X})$, which is a distribution in an finite dimensional space. (Recall that we use $\\mathbf{f}$ as $\\text{f}(\\mathbf{X})$ and $p(\\mathbf{f}|\\mathbf{y})$ as $p(\\text{f}(\\mathbf{X})|\\mathbf{y},\\mathbf{X})$ for short).\n",
    "\n",
    "The key intuition of variational inference is to approximate the distribution of interest (here $p(\\mathbf{f}|\\mathbf{y})$, which is a non-Gaussian distribution) with a more tractable distribution (for variational Gaussian inference, a Gaussian $q(\\mathbf{f}|\\mathbf{y}))$, via minimizing the <a href=\"http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence\">Kullback–Leibler divergence</a> (KL divergence) \n",
    "\n",
    "$${\\mathrm{KL}}(Q\\|P) = \\int_{-\\infty}^\\infty \\ln\\left(\\frac{q(\\mathbf{f}|\\mathbf{y})}{p(\\mathbf{f}|\\mathbf{y})}\\right) q(\\mathbf{f}|\\mathbf{y}) \\ {\\rm d}\\mathbf{f}$$\n",
    "\n",
    "between the true distribution and the approximation.\n",
    "\n",
    "Please see <a id='detailbg'>next section</a> if you want to know the reason why making prediction of GPC is not easy when $p(\\mathbf{f}|\\mathbf{y})$ is non-Gaussian. Therefore, we want to use a more tractable distribution $q(\\mathbf{f}|\\mathbf{y}))$ to perform prediction\n",
    "\n",
    "Throughout this notebook, we deal with binary classification problems using <a href=\"http://en.wikipedia.org/wiki/Logit\">inverse-logit</a>, (also known as Bernoulli-logistic function)\n",
    "likelihood to model labels, given by \n",
    "\n",
    "$p(\\mathbf{y}|\\mathbf{f})=\\prod_{i=1}^n p(y_\\text{i}|\\text{f}(\\mathbf{X}_\\text{i}))=\\prod_{i=1}^n \\frac{1}{1+\\exp(-y_\\text{i} \\mathbf{f}_\\text{i})}$,\n",
    "where $\\mathbf{f}_\\text{i}$ is the i-th row of $\\mathbf{X}$ and $\\text{f}(\\mathbf{X}_\\text{i})=\\mathbf{f}_\\text{i}$\n",
    "\n",
    "Note that In GPC models we implicitly assume that $\\mathbf{y}$ are **conditional independent** if (latent) $\\mathbf{f}$ are known.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='detailbg'>More Detailed Background about GPC Models</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In GPC models, our goal is to predict the label, $y_\\text{new}$, of a new data point, a row vector, $\\mathbf{x}_\\text{new}\\in\\mathcal{R}^{d}$ based on $p(y_\\text{new}|\\mathbf{y},\\mathbf{X},\\mathbf{x}_\\text{new})$. \n",
    "According to <a href=\"http://en.wikipedia.org/wiki/Bayes%27_theorem\">Bayes' theorem</a>, we know that <a href=\"http://en.wikipedia.org/wiki/Posterior_predictive_distribution\"> the posterior predictive distribution</a> is    $$p(y_\\text{new}|\\mathbf{y},\\mathbf{X},\\mathbf{x}_\\text{new})= \\int {p(y_\\text{new}|\\text{f},\\mathbf{x}_\\text{new})p(\\text{f}|\\mathbf{y},\\mathbf{X}) d\\text{f}}$$ \n",
    "Informally, according to <a href=\"http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Marginal_distributions\">the property</a> of GP about marginalization, we have:\n",
    "\n",
    "$$\\int {p(\\mathbf{y}_\\text{new}|\\text{f},\\mathbf{x}_\\text{new})p(\\text{f}|\\mathbf{y},\\mathbf{X}) d\\text{f}}= \\int {p(\\mathbf{y}_\\text{new}|\\text{f}(\\mathbf{x}_\\text{new}),\\mathbf{x}_\\text{new})p(\\text{f}(\\mathbf{x}_\\text{new})|\\mathbf{y},\\mathbf{X}) d\\text{f}(\\mathbf{x}_\\text{new})} = \\int {p(\\mathbf{y}_\\text{new}|\\mathbf{f}_\\text{new})p(\\mathbf{f}_\\text{new}|\\mathbf{y},\\mathbf{X}) d\\mathbf{f}_\\text{new}} $$.\n",
    "\n",
    "where $\\text{f}(\\mathbf{x}_\\text{new})=\\mathbf{f}_\\text{new}$, $p(\\mathbf{y}_\\text{new}|\\mathbf{f}_\\text{new})=p(\\mathbf{y}_\\text{new}|\\text{f},\\mathbf{x}_\\text{new})$ .\n",
    "\n",
    "The key difference here is that $p(\\text{f}|\\mathbf{y}, \\mathbf{X})$ is a distribution in an infinite dimensional space while $p(\\mathbf{f}_{new}|\\mathbf{y},\\mathbf{X})$ is a distribution in a one-dimensional space.\n",
    "\n",
    "Note that $p(\\mathbf{f}_\\text{new}|\\mathbf{y},\\mathbf{X})=\\int {p(\\mathbf{f}_\\text{new}|\\text{f}) p(\\text{f}|\\mathbf{y},\\mathbf{X}) d\\text{f}}$.\n",
    "\n",
    "Similarly, $$\\int {p(\\mathbf{f}_\\text{new}|\\text{f}) p(\\text{f}|\\mathbf{y},\\mathbf{X}) d\\text{f}}=\\int {p(\\mathbf{f}_\\text{new}|\\text{f}(\\mathbf{X})) p(\\text{f}(\\mathbf{X})|\\mathbf{y},\\mathbf{X}) d\\text{f}(\\mathbf{X}) } = \\int {p(\\mathbf{f}_\\text{new}|\\mathbf{f}) p(\\mathbf{f}|\\mathbf{y}) d\\mathbf{f}} $$.\n",
    "\n",
    "Recall that $p(\\mathbf{f}|\\mathbf{y})=p(\\text{f}(\\mathbf{X})|\\mathbf{y}, \\mathbf{X})$ is a distribution in a n-dimensional space while $p(\\text{f}|\\mathbf{y},\\mathbf{X})$ is a distribution in an infinite dimensional space.\n",
    "\n",
    "Informally, accodring to GP, the following holds:\n",
    "\n",
    "\n",
    "$p(\\mathbf{f}_\\text{new}|\\mathbf{f})=\\frac{p(\\mathbf{f}_\\text{new},\\mathbf{f})}{p(\\mathbf{f})}= \\frac{p(\\text{f}(\\mathbf{ \\left[ \\begin{array}{c} \\mathbf{X} \\\\ \\hdashline \\mathbf{x}_\\text{new} \\end{array} \\right] }))}{p(\\text{f}(\\mathbf{\\mathbf{X}}))}$,  where $\\mathbf{ \\left[ \\begin{array}{c} \\mathbf{X} \\\\ \\hdashline \\mathbf{x}_\\text{new} \\end{array} \\right] } \\in \\mathcal{R}^{(n+1)\\times d}$ and $\\text{f}(\\mathbf{ \\left[ \\begin{array}{c} \\mathbf{X} \\\\ \\hdashline \\mathbf{x}_\\text{new} \\end{array} \\right] }) \\in \\mathcal{R}^{n+1} $\n",
    "\n",
    "Note that If $p(\\mathbf{f}|\\mathbf{y})$ is not a Gaussian distribution, $p(\\mathbf{f}_{new}|\\mathbf{y},\\mathbf{X})$ usually has not a closed form. Unfortunately, in classification problems, $p(\\mathbf{f}|\\mathbf{y})$ usually is a non-Gaussian distribution since $p(\\mathbf{f}|\\mathbf{y}) \\propto p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f})$, where $p(\\mathbf{f})$ is a Gaussian distribution but $p(\\mathbf{y}|\\mathbf{f}))$ (the likelihood) is a non-Gaussian distribution to model discrete labels.\n",
    "\n",
    "Variational **Gaussian** inference in GPC is to approximate $p(\\mathbf{f}|\\mathbf{y})$ using **a Gaussian distribution**, $q(\\mathbf{f}|\\mathbf{y})$, via minimizing the KL divergence, ${\\mathrm{KL}}(Q\\|P)$. Readers may note that the KL divergence is asymmetric. If we minimize ${\\mathrm{KL}}(P\\|Q)$ instead of ${\\mathrm{KL}}(Q\\|P)$, the inference method is called <a href=\"http://en.wikipedia.org/wiki/Expectation_propagation\">expectation propagation</a> (EP). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='VGI'>Variational Gaussian Inference for GPC Models</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As mentioned in section <a href='#bigpic'>Big Picture of Variational Inference for GPC Models</a>, variaitonal inference in GPC models is about **minimizing** the KL divergence given as below:\n",
    "\n",
    "${\\mathrm{KL}}(Q\\|P) = \\int_{-\\infty}^\\infty \\ln\\left(\\frac{q(\\mathbf{f}|\\mathbf{y})}{p(\\mathbf{f}|y)}\\right) q(\\mathbf{f}|\\mathbf{y}) \\ {\\rm d}\\mathbf{f} = \\int_{-\\infty}^\\infty \\ln\\left(\\frac{q(\\mathbf{f}|\\mathbf{y})}{  \\frac{p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f})}{p(\\mathbf{y})}  }\\right) q(\\mathbf{f}|\\mathbf{y}) \\ {\\rm d}\\mathbf{f}=\\int_{-\\infty}^\\infty \\ln\\left(\\frac{q(\\mathbf{f}|\\mathbf{y})}{ p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f}|\\mathbf{y})  }\\right) q(\\mathbf{f}|\\mathbf{y}) \\ {\\rm d}\\mathbf{f} - \\text{const}$. \n",
    "\n",
    "\n",
    "Another way to explain variational inference in GPC is to **maximize** a lower (variational) bound of the log of marginal likelihood, $\\ln (p(\\mathbf{y}|\\mathbf{X})) $.\n",
    "\n",
    "$\\ln (p(\\mathbf{y}|\\mathbf{X})) = \\ln (\\int_{-\\infty}^\\infty {p(\\mathbf{y}|\\text{f},\\mathbf{X})p(\\text{f}|\\mathbf{X})} d\\text{f}) =  \\ln (\\int_{-\\infty}^\\infty {p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f})} d\\mathbf{f})= \\ln (\\int_{-\\infty}^\\infty { q(\\mathbf{f}) \\frac{ p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f})} {q(\\mathbf{f}|\\mathbf{y})}} d\\mathbf{f}) \\geq  (\\int_{-\\infty}^\\infty { q(\\mathbf{f}|\\mathbf{y}) \\ln ( \\frac{ p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f})} {q(\\mathbf{f}|\\mathbf{y})}} ) d\\mathbf{f})$\n",
    "\n",
    "\n",
    "where the inequality is based on <a href=\"http://en.wikipedia.org/wiki/Jensen%27s_inequality\">Jensen’s inequality</a> and $\\int_{-\\infty}^\\infty { q(\\mathbf{f}|\\mathbf{y}) \\ln ( \\frac{ p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f})} {q(\\mathbf{f}|\\mathbf{y})}} ) d\\mathbf{f}$ is a lower (variational) bound .\n",
    "\n",
    "$\\int_{-\\infty}^\\infty { q(\\mathbf{f}|\\mathbf{y}) \\ln ( \\frac{ p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f})} {q(\\mathbf{f}|\\mathbf{y})}} ) d\\mathbf{f}=- \\int_{-\\infty}^\\infty \\ln\\left(\\frac{q(\\mathbf{f}|\\mathbf{y})}{ p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f})  }\\right) q(\\mathbf{f}|\\mathbf{y}) \\ {\\rm d}\\mathbf{f} = -E_q[\\ln(q(\\mathbf{f}|\\mathbf{y}))] + E_q[\\ln(p(\\mathbf{f}))] + E_q[\\ln(p(\\mathbf{y}|\\mathbf{f}))]$, \n",
    "\n",
    "where $E_q(\\cdot)$ denotes the expectation with respect to $q(\\mathbf{f}|\\mathbf{y})$.\n",
    "\n",
    "Note that in **general** variaitonal inference, $q(\\mathbf{f}|\\mathbf{y})$ can be a free-form distribution learnt by minimizing the KL divergence. \n",
    "In variational **Gaussian** inference, $q(\\mathbf{f}|\\mathbf{y})$ is a Gaussian distribution and distribution parameters are estimated by minimizing the KL divergence\n",
    "\n",
    "In variaitonal Gaussian inference in GPC, the last term, $E_q[\\ln(p(\\mathbf{y}|\\mathbf{f}))]$, usually does **not** have a closed form. Recall that $p(\\mathbf{y}|\\mathbf{f})$ usually is a non-Gaussian distrbution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='nonclosedForm'>Dealing with the No-closed Form Issue in GPC Models</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In variational Gaussian inference, we can show that $E_q[\\ln(p(\\mathbf{y}|\\mathbf{f}))]$ is summation of one-dimensional integrations via the similar <a href=\"http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Marginal_distributions\">marginalization property</a> of multivariate Gaussian distribution. Recall that labels ($\\mathbf{y}$) are conditional independent given $\\mathbf{f}$.  We can obtain the following equations:\n",
    "\n",
    "\\begin{equation}\n",
    "\\begin{array}{ll}\n",
    "E_q[\\ln(p(\\mathbf{y}|\\mathbf{f}))] &\\\\\n",
    "=E_q[\\sum_{i=1}^n {\\ln(p(\\mathbf{y}_\\text{i}|\\mathbf{f}_\\text{i})}] & \\text{(conditional independence)} \\\\\n",
    "=\\sum_{i=1}^n {E_q[\\ln(p(\\mathbf{y}_\\text{i}|\\mathbf{f}_\\text{i})]} & \\text{(linearity of expectation)} \\\\\n",
    "=\\sum_{i=1}^n {E_{q_\\text{i}}[\\ln(p(\\mathbf{y}_\\text{i}|\\mathbf{f}_\\text{i})]} & \\text{(marginalization property)} \\\\\n",
    "\\end{array}\n",
    "\\end{equation}\n",
    "\n",
    "where $q$ denotes $q(\\mathbf{f}|\\mathbf{y})$ is a multivariable $N(\\mu, \\Sigma)$, $q_i$ denotes $q_i(\\mathbf{f}_\\text{i}|\\mathbf{y}_\\text{i})$ is a univariate $N(\\mu_\\text{i}, \\Sigma_\\text{i,i})$, and $\\mu_\\text{i}$ and $\\Sigma_\\text{i,i}$ are the i-th element of the mean $\\mu$ and the i-th diagonal element of the covariance matrix $\\Sigma$ of $q(\\mathbf{f}|\\mathbf{y})$ respectively.\n",
    "\n",
    "\n",
    "Of course, $E_{q_\\text{i}}[\\ln(p(\\mathbf{y}_\\text{i}|\\mathbf{f}_\\text{i})]$ usually does not have a closed form in GPC. However, it is relatively easy to deal with this one-dimensional problem.  \n",
    "One way to deal with this issue is numerical integration or monte carlo methods.\n",
    "Another way is to use some variational lower bound of $E_{q_\\text{i}}[\\ln(p(\\mathbf{y}_\\text{i}|\\mathbf{f}_\\text{i})]$. For now, we assume this expectation, $E_{q_\\text{i}}[\\ln(p(\\mathbf{y}_\\text{i}|\\mathbf{f}_\\text{i})]$, is given and we will discuss later in section <a href=\"#bgvb\">Some Background of Variational Bounds in GPC Models</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='likelihood'>Likelihood Classes for GPC Models in Shogun</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table>\n",
    "<tr>\n",
    "<th>Approach to deal with non-closed Form</th>\n",
    "<th>Implemenation in Shogun (Likelihood)</th>\n",
    "<th>Supported Inference Method</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://en.wikipedia.org/wiki/Gauss%E2%80%93Hermite_quadrature\">Gaussian Numerical Integration for One-dimensional Space</a></td>\n",
    "<td> <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLogitVGLikelihood.html\">LogitVGLikelihood (Bernoulli-logistic function)</a> </td>\n",
    "\n",
    "\n",
    "<td>\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSingleLaplaceInferenceMethod.html\">Laplace Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCovarianceInferenceMethod.html\">Covariance Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCholeskyInferenceMethod.html\">Cholesky Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDiagonalInferenceMethod.html\">Mean-field Variational Method</a>\n",
    "</td>\n",
    "\n",
    "\n",
    "\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://www.icml-2011.org/papers/376_icmlpaper.pdf\">The Piecewise Variational Bound for Bernoulli-logistic function</a></td>\n",
    "<td><a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLogitVGPiecewiseBoundLikelihood.html\">LogitVGPiecewiseBoundLikelihood</a></td>\n",
    "<td>\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSingleLaplaceInferenceMethod.html\">Laplace Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCovarianceInferenceMethod.html\">Covariance Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDiagonalInferenceMethod.html\">Mean-field Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCholeskyInferenceMethod.html\">Cholesky Variational Method</a>\n",
    "</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "<td><a href=\"http://www.cs.cmu.edu/~lafferty/pub/ctm.pdf\">The Bound for Bernoulli-logistic function from Blei's work</a></td>\n",
    "<td><a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLogitDVGLikelihood.html\">LogitDVGLikelihood</a> The bound is implemented in the dual form</td> \n",
    "\n",
    "\n",
    "<td>\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDualInferenceMethod.html\">Dual Variational Method</a>\n",
    "</td>\n",
    "</tr>\n",
    "\n",
    "\n",
    "<tr>\n",
    "<td><a href=\"https://circle.ubc.ca/handle/2429/43640\">The Jaakkola Bound for Bernoulli-logistic function</a></td>\n",
    "<td>NA</td>\n",
    "<td>NA</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "<td><a href=\"https://circle.ubc.ca/handle/2429/43640\">The Bohning Bound for Bernoulli-logistic function</a></td>\n",
    "<td>NA</td>\n",
    "<td>NA</td>\n",
    "</tr>\n",
    "\n",
    "\n",
    "<tr>\n",
    "<td><a href=\"http://arxiv.org/abs/1206.6430\">Monte Carlo Method</a></td>\n",
    "<td>NA</td>\n",
    "<td>NA</td>\n",
    "</tr>\n",
    "\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='inference'>Inference Methods for GPC Models in Shogun</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table>\n",
    "<tr>\n",
    "<th>Inference Method in GPC</th>\n",
    "<th>Idea of Approximation</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSingleLaplaceInferenceMethod.html\">Laplace Method</a></td>\n",
    "<td>Using the second-order Taylor expansion in the mode of the true posterior</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCovarianceInferenceMethod.html\">Covariance Variational Method</a></td>\n",
    "<td>Minimizing KL(approximated distribution || true posterior), \n",
    "where the approximated Gaussian distribution has a complete covariance matrix\n",
    "</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDiagonalInferenceMethod.html\">Mean-field Variational Method</a></td>\n",
    "<td>Minimizing KL(approximated distribution || true posterior), \n",
    "where the approximated Gaussian distribution has a diagonal covariance matrix</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "<td><a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCholeskyInferenceMethod.html\">Cholesky Variational Method</a></td>\n",
    "<td>Minimizing KL(approximated distribution || true posterior), \n",
    "where the approximated Gaussian distribution has a complete covariance matrix in \n",
    "Cholesky representation</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "<td><a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDualInferenceMethod.html\">Dual Variational Method</a></td>\n",
    "<td>Minimizing KL(approximated distribution || true posterior) using variational Gaussian inference in a dual form\n",
    "</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "<td><a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CEPInferenceMethod.html\">Expectation Propagation Method</a></td>\n",
    "<td>Minimizing KL(true posterior || approximated distribution), where the approximated distribution is a Gaussian distribution\n",
    "</td>\n",
    "</tr>\n",
    "\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='opt'>Optimizers Used in Inference Methods(for advanced users)</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table>\n",
    "<tr>\n",
    "<th>Optimizer</th>\n",
    "<th>Inference Method</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://en.wikipedia.org/wiki/Limited-memory_BFGS\">Limited-memory BFGS (LBFGS)</a></td>\n",
    "<td>\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSingleLaplaceInferenceMethod.html\">Laplace Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCovarianceInferenceMethod.html\">Covariance Variational Method (default optimizer)</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDiagonalInferenceMethod.html\">Mean-field Variational Method (default optimizer)</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDualInferenceMethod.html\">Dual Variational Method (default optimizer)</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCholeskyInferenceMethod.html\">Cholesky Variational Method (default optimizer)</a>\n",
    "</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization\">Newton-Raphson</a></td>\n",
    "<td><a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSingleLaplaceInferenceMethod.html\">Laplace Method (default optimizer)</a>\n",
    "</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://en.wikipedia.org/wiki/Conjugate_gradient_method\">Conjugate Gradient</a></td>\n",
    "<td>NA</td>\n",
    "</tr>\n",
    "\n",
    "\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='linesearch'>Line Search Methods Used in LBFGS Optimizer(for advanced users)</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table>\n",
    "<tr>\n",
    "<th>Line Search Method in Limited-memory BFGS Optimizer</th>\n",
    "<th>Inference Method</th>\n",
    "<th>Value in <a href=\"http://www.shogun-toolbox.org/doc/en/latest/lbfgscommon_8h.html\">the ShogunAPI</a></th>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.660\">MoreThuente method</a></td>\n",
    "<td>\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSingleLaplaceInferenceMethod.html\">Laplace Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCovarianceInferenceMethod.html\">Covariance Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDiagonalInferenceMethod.html\">Mean-field Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCholeskyInferenceMethod.html\">Cholesky Variational Method</a>,\n",
    "(For now, **NOT** support <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDualInferenceMethod.html\">Dual Variational Method</a>)\n",
    "</td>\n",
    "<td>0</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://en.wikipedia.org/wiki/Wolfe_conditions\">Backtracking method with the Armijo condition</a></td>\n",
    "\n",
    "\n",
    "<td>\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSingleLaplaceInferenceMethod.html\">Laplace Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCovarianceInferenceMethod.html\">Covariance Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDiagonalInferenceMethod.html\">Mean-field Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDualInferenceMethod.html\">Dual Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCholeskyInferenceMethod.html\">Cholesky Variational Method</a>\n",
    "</td>\n",
    "<td>1</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td><a href=\"http://en.wikipedia.org/wiki/Wolfe_conditions\">Backtracking method with the (regular or strong) Wolfe condition</a></td>\n",
    "\n",
    "<td>\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSingleLaplaceInferenceMethod.html\">Laplace Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCovarianceInferenceMethod.html\">Covariance Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDiagonalInferenceMethod.html\">Mean-field Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLDualInferenceMethod.html\">Dual Variational Method</a>,\n",
    "<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKLCholeskyInferenceMethod.html\">Cholesky Variational Method</a>\n",
    "</td>\n",
    "<td>2 (regular),3 (strong, default line search method)</td>\n",
    "\n",
    "</tr>\n",
    "\n",
    "\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='toy'>A Toy Example of Variational Gaussian Inference</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we present a 2-D example for GP binary classification using variational Gaussian inference. The gaussian prior and the likelihood are shown as below. For this simple example, we can numerically compute the unnormalized true posterior, shown in the following figure, using the Bayes rule: $p(\\mathbf{f}|\\mathbf{y}) \\propto p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f})$, where $p(\\mathbf{f})$ is the prior and $p(\\mathbf{y}|\\mathbf{f}))$ is the likelihood. The approximated posterior obtained using various methods are also given.\n",
    "The figure can be obtained by using the following code. Note that this example is closely related to the example at page 7-8 of the paper, <a href=\"http://www.jmlr.org/papers/volume9/nickisch08a/nickisch08a.pdf\">Approximations for binary Gaussian process classification</a>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def Gaussian_points(Sigma,mu,xmin,xmax,ymin,ymax,delta=0.01):\n",
    "    \"\"\"\n",
    "    This function is used to evaluate the likelihood of a Gaussian distribution on mesh.\n",
    "    \n",
    "    Parameters:\n",
    "         Sigma - covariance of Gaussian\n",
    "         mu - mean of Gaussian\n",
    "         xmin, xmax, ymin, ymax, delta - defines mesh\n",
    "         \n",
    "    Returns:\n",
    "    X, Y, Z, where Z = log(p(X,Y)) and p is a bivariate gaussian (mu, Sigma) evaluated at a mesh X,Y\n",
    "    \"\"\"\n",
    "\n",
    "    xlist = np.arange(xmin, xmax, delta)\n",
    "    ylist = np.arange(ymin, ymax, delta)\n",
    "    X, Y = np.meshgrid(xlist, ylist)\n",
    "    model = GaussianDistribution(mu, Sigma)\n",
    "    Z = np.zeros(len(X)*len(Y))\n",
    "    idx = 0\n",
    "    for items in zip(X,Y):\n",
    "        for sample in zip(items[0],items[1]):\n",
    "            sample = np.asarray(sample)\n",
    "            Z[idx] = model.log_pdf(sample)\n",
    "            idx += 1\n",
    "    Z = np.asarray(Z).reshape(len(X),len(Y))\n",
    "    return (X,Y,Z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def likelihood_points(X,Y,labels,likelihood):\n",
    "    \"\"\"\n",
    "    This function is used to evaluate a given likelihood function on a given mesh of points\n",
    "    \n",
    "    Parameters:\n",
    "        X, Y - coordiates of the mesh\n",
    "        labels - labels for the likelihood\n",
    "        likelihood - likelihood function\n",
    "        \n",
    "    Returns:\n",
    "        Z - log(p(X,Y,labels)), where p comes from likelihood\n",
    "    \"\"\"\n",
    "    \n",
    "    Z = np.zeros(len(X)*len(Y))    \n",
    "    idx = 0\n",
    "    for items in zip(X,Y):\n",
    "        for sample in zip(items[0],items[1]):\n",
    "            sample = np.asarray(sample)\n",
    "            lpdf = likelihood.get_log_probability_f(labels, sample).sum()\n",
    "            Z[idx] = lpdf\n",
    "            idx += 1\n",
    "    Z = np.asarray(Z).reshape(len(X),len(Y))\n",
    "    return Z"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def approx_posterior_plot(methods, kernel_func, features, mean_func, \n",
    "                          labels, likelihoods, kernel_log_scale, \n",
    "                          xmin, xmax, ymin, ymax, delta, plots):\n",
    "    \"\"\"\n",
    "    This function is used to generate points drawn from approximated posterior and plot them\n",
    "        \n",
    "    Parameters:\n",
    "        methods - a list of methods used to approximate the posterior\n",
    "        kernel_func - a covariance function for GP\n",
    "        features - X\n",
    "        mean_func -  mean function for GP\n",
    "        labels - Y\n",
    "        likelihood- a data likelihood to model labels\n",
    "        kernel_log_scale - a hyper-parameter of covariance function\n",
    "        xmin, ymin, xmax, ymax, delta - used for generating points from an approximated posterior\n",
    "        plots - subplot\n",
    "    Returns:\n",
    "        Nothing\n",
    "    \"\"\"\n",
    "    \n",
    "    (rows, cols) = plots.shape\n",
    "    methods = np.asarray(methods).reshape(rows, cols)\n",
    "    likelihoods = np.asarray(likelihoods).reshape(rows, cols)\n",
    "    for r in range(rows):\n",
    "        for c in range(cols):\n",
    "            inference = methods[r][c]\n",
    "            likelihood = likelihoods[r][c]\n",
    "            inf = inference(kernel_func, features, mean_func, labels, likelihood())\n",
    "            inf.set_scale(exp(kernel_log_scale))\n",
    "            #get the approximated Gaussian distribution\n",
    "            mu = inf.get_posterior_mean()\n",
    "            Sigma = inf.get_posterior_covariance()\n",
    "            #normalized approximated posterior\n",
    "            (X,Y,Z) = Gaussian_points(Sigma, mu, xmin, xmax, ymin, ymax, delta)\n",
    "            CS = plots[r][c].contour(X, Y, np.exp(Z))\n",
    "            plots[r][c].set_title('posterior via %s'%inf.get_name())\n",
    "            plots[r][c].axis('equal')\n",
    "            plots[r][c].clabel(CS,inline=1,fontsize=10)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#a toy 2D example (data)\n",
    "x=np.asarray([sqrt(2),-sqrt(2)]).reshape(1,2)\n",
    "y=np.asarray([1,-1])\n",
    "\n",
    "features = RealFeatures(x)\n",
    "labels = BinaryLabels(y)\n",
    "kernel_log_sigma = 1.0\n",
    "kernel_log_scale = 1.5\n",
    "\n",
    "#a mean function and a covariance function for GP\n",
    "mean_func = ConstMean()\n",
    "#using log_sigma as a hyper-parameter of GP instead of sigma\n",
    "kernel_sigma = 2*exp(2*kernel_log_sigma)\n",
    "kernel_func = GaussianKernel(10, kernel_sigma)\n",
    "kernel_func.init(features, features)\n",
    "#a prior distribution derived from GP via applying the mean function and the covariance function to data\n",
    "Sigma = kernel_func.get_kernel_matrix()\n",
    "Sigma = Sigma * exp(2.0*kernel_log_scale)\n",
    "mu = mean_func.get_mean_vector(features)\n",
    "\n",
    "delta = 0.1\n",
    "xmin = -4\n",
    "xmax = 6\n",
    "ymin = -6\n",
    "ymax = 4\n",
    "\n",
    "#a prior (Gaussian) derived from GP\n",
    "(X,Y,Z1) = Gaussian_points(Sigma, mu, xmin, xmax, ymin, ymax, delta)\n",
    "\n",
    "col_size = 6\n",
    "f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(col_size*3,col_size))\n",
    "\n",
    "CS1=ax1.contour(X, Y, np.exp(Z1))\n",
    "ax1.set_title('Prior')\n",
    "ax1.axis('equal')\n",
    "\n",
    "ax1.clabel(CS1,\n",
    "           inline=1,\n",
    "           fontsize=10)\n",
    "\n",
    "#likelihood (inverse logit, A.K.A. Bernoulli-logistic)\n",
    "#likelihoods classes for inference methods  (see the likelihood table)\n",
    "likelihoods=[\n",
    "LogitLikelihood,\n",
    "LogitVGLikelihood,\n",
    "#LogitVGLikelihood,\n",
    "LogitVGLikelihood,\n",
    "LogitDVGLikelihood\n",
    "]\n",
    "#likelihood\n",
    "Z2 = likelihood_points(X,Y,labels,LogitLikelihood())\n",
    "CS2 = ax2.contour(X, Y, np.exp(Z2))\n",
    "ax2.set_title('Likelihood')\n",
    "ax2.axis('equal')\n",
    "ax2.clabel(CS2,\n",
    "           inline=1,\n",
    "           fontsize=10)\n",
    "\n",
    "#a unnormalized true posterior (non-Gaussian)\n",
    "Z3 = Z1+Z2\n",
    "CS3 = ax3.contour(X, Y, np.exp(Z3))\n",
    "ax3.set_title('True posterior')\n",
    "ax3.axis('equal')\n",
    "ax3.clabel(CS3,\n",
    "           inline=1,\n",
    "           fontsize=10)\n",
    "\n",
    "f, plots =plt.subplots(2, 2, figsize=(col_size*3,col_size*2))\n",
    "\n",
    "#Inference methods used to approximate a posterior (see the table below)\n",
    "methods=[\n",
    "SingleLaplaceInferenceMethod,\n",
    "KLDiagonalInferenceMethod,\n",
    "#KLCovarianceInferenceMethod,\n",
    "KLCholeskyInferenceMethod,\n",
    "KLDualInferenceMethod\n",
    "]\n",
    "#posterior\n",
    "approx_posterior_plot(methods, kernel_func, features, mean_func, labels, likelihoods, kernel_log_scale, \n",
    "                      xmin, xmax, ymin, ymax, delta, plots)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remark:\n",
    "The true posterior is a non-Gaussian distribution and the normalized constant usually is difficult to compute.\n",
    "All approximated posteriors are Gaussian distributions. A posterior Gaussian distribution can efficiently perform inference and predictions.\n",
    "The Laplace approximation is not ideal in term of approximating a posterior and making predictions although it is fast.\n",
    "With the price of speed, variational methods such as the KLCovariance and the KLCholesky offer more accurate approximations compared than the Laplace method. The KLDual method also can have a good approximation and in many cases are better than the Laplace method. On the other hand, the KLApproxDiagonal (mean-field) method is as fast as the Laplace method and for certain datasets, the KLApproxDiagonal method is more accurate than Laplace method in term of making predictions. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id='visu'>An Example for Visualization (banana dataset)</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the banana data set to demonstrate decision boundary of GPC. The data set can be found at <a href=\"http://sci2s.ugr.es/keel/dataset.php?cod=182\">here</a>. This is a binary classification problem. The goal of this example is to visually show the decision boundary of GPC models. The following is the description of the data set.\n",
    "\n",
    "It is an artificial data set where instances belongs to several clusters with a banana shape. There are two attributes At1 and At2 corresponding to the x and y axis, respectively. The class label (-1 and 1) represents one of the two banana shapes in the dataset.\n",
    "\n",
    "We now define code to train a GP classifier using Shogun."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def train_small_scale(inference, linesearch, likelihood, x_train, y_train, kernel_log_sigma, kernel_log_scale):\n",
    "    \n",
    "    \"\"\"\n",
    "    This function is used to train a GP classifer given an inference method\n",
    "        \n",
    "    Parameters:\n",
    "        inference - an inference methods used to train the classifer\n",
    "        linesearch - a linesearch method used in the inference method \n",
    "        likelihood - likelihood function to model data\n",
    "        x_train, x_test - X for training and testing\n",
    "        y_train - labels for training\n",
    "        kernel_log_sigma, kernel_log_scale hyper-parameters of covariance function    \n",
    "    Returns:\n",
    "        trained GPC model, name of inference method\n",
    "    \"\"\"\n",
    "\n",
    "    mean_func = ZeroMean()\n",
    "    kernel_sigma = 2*exp(2*kernel_log_sigma);\n",
    "    kernel_func = GaussianKernel(10, kernel_sigma)\n",
    "\n",
    "    #Y is a sample-by-1 vector\n",
    "    labels_train = BinaryLabels(y_train)\n",
    "    #X is a feature-by-sample matrix\n",
    "    features_train=RealFeatures(x_train)\n",
    "\n",
    "    inf = inference(kernel_func, features_train, mean_func, labels_train, likelihood)\n",
    "    inf.set_scale(exp(kernel_log_scale))\n",
    "\n",
    "    gp = GaussianProcessClassification(inf)\n",
    "    gp.train()\n",
    "\n",
    "    return gp, inf.get_name()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def extract_banana_dataset(input_path):\n",
    "    \"\"\"\n",
    "    This function is used to process raw data of the banana data set\n",
    "        \n",
    "    Parameters:\n",
    "        inf - the path of the raw data\n",
    "    Returns:\n",
    "        x_train, x_test - features for training and testing\n",
    "        y_train, y_test - labels for training and testing\n",
    "        x1s, x2s  -for ploting training or testing points\n",
    "    \"\"\"\n",
    "\n",
    "    random.seed(1)\n",
    "    x=[[], []]\n",
    "    y=[[], []]\n",
    "    x1s=[{}, {}]\n",
    "    x2s=[{}, {}]\n",
    "\n",
    "    for line in open(input_path):\n",
    "        line=line.strip()\n",
    "        if line.startswith(\"@\"):\n",
    "            continue\n",
    "        choice=random.randint(0,1)\n",
    "        if choice==0:\n",
    "            #testing set\n",
    "            idx=1\n",
    "        else:\n",
    "            #training set\n",
    "            idx=0\n",
    "        info=list(map(float,line.split(',')))\n",
    "        x[idx].append(info[:-1])\n",
    "        y[idx].append(info[-1])\n",
    "        if info[-1]>0:\n",
    "            #positive cases\n",
    "            container_x1=x1s[idx].setdefault(1,[])\n",
    "            container_x2=x2s[idx].setdefault(1,[])\n",
    "        else:\n",
    "            #negative cases\n",
    "            container_x1=x1s[idx].setdefault(0,[])\n",
    "            container_x2=x2s[idx].setdefault(0,[])\n",
    "        container_x1.append(info[0])\n",
    "        container_x2.append(info[1])\n",
    "\n",
    "    x_train = np.asarray(x[0]).T\n",
    "    y_train = np.asarray(y[0])\n",
    "    x_test = np.asarray(x[1]).T\n",
    "    y_test = np.asarray(y[1])\n",
    "\n",
    "    return x_train, y_train, x_test, y_test, x1s, x2s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from itertools import product\n",
    "def plot_small_scale(input_path):\n",
    "    \"\"\"\n",
    "    This function is used to print the decision bounday and testing points\n",
    "        \n",
    "    Parameters:\n",
    "        inf - the path of the raw data\n",
    "    \"\"\"\n",
    "    #obtain data points with labels from raw data\n",
    "    (x_train, y_train, x_test, y_test, x1s, x2s)=extract_banana_dataset(input_path)\n",
    "    \n",
    "    #binary classification problem (positive labels and negative labels)\n",
    "    inference =SingleLaplaceInferenceMethod\n",
    "    likelihood = LogitLikelihood()\n",
    "    linesearch = 3\n",
    "    \n",
    "    #we show how parameters of GPC affect the decision boundary\n",
    "    #over-fit, right-fit, under-fit\n",
    "    kernel_log_sigmas=[-5,0,1]\n",
    "    kernel_log_scale=0 #we fix the scale parameter\n",
    "    \n",
    "    #plot setting\n",
    "    col_size=8\n",
    "    f, plots =plt.subplots(len(kernel_log_sigmas),2,figsize=(col_size*2,col_size*len(kernel_log_sigmas)))\n",
    "    \n",
    "    #points used to generate decision boundary\n",
    "    n_boundary=50\n",
    "    x1_boundary = np.linspace(x_train[0,:].min()-1, x_train[0,:].max()+1, n_boundary)\n",
    "    x2_boundary = np.linspace(x_train[1,:].min()-1, x_train[1,:].max()+1, n_boundary)\n",
    "    x_boundary = np.asarray(list(product(x1_boundary, x2_boundary))).T\n",
    "\n",
    "    features_boundary=RealFeatures(x_boundary)\n",
    "    \n",
    "    for idx,kernel_log_sigma in enumerate(kernel_log_sigmas):\n",
    "        #train a GPC model given traning data points\n",
    "        (gpc, name)=train_small_scale(inference, linesearch, likelihood, x_train, y_train, kernel_log_sigma, kernel_log_scale)\n",
    "        #obtain the probabilities of being positive label(s) given new data point(s)\n",
    "        prbs=gpc.get_probabilities(features_boundary)\n",
    "\n",
    "        for choice in [0,1]:\n",
    "             #decision boundary\n",
    "             plots[idx][choice].contour(x1_boundary, x2_boundary, np.reshape(prbs, (n_boundary, n_boundary)).T, levels=[0.5], colors=('blue'))  \n",
    "             #training points or testing points with true positive tag\n",
    "             plots[idx][choice].scatter(x1s[choice][1],x2s[choice][1], c='red', alpha=0.5)\n",
    "             #training points or testing points with true negative tag\n",
    "             plots[idx][choice].scatter(x1s[choice][0],x2s[choice][0], c='yellow', alpha=0.5)\n",
    "             plots[idx][choice].set_xlabel(\"At1\")\n",
    "             plots[idx][choice].set_ylabel(\"At2\")\n",
    "             plots[idx][choice].axis(\"equal\")\n",
    "             type_name=\"training\"\n",
    "             if choice==1:\n",
    "                type_name=\"testing\"\n",
    "             if idx==0:\n",
    "                fit_condition=\"over-fit\"\n",
    "             elif idx==1:\n",
    "                fit_condition=\"right-fit\"\n",
    "             else:\n",
    "                fit_condition=\"under-fit\"                           \n",
    "             plots[idx][choice].set_title(\"Decision boundary (blue) \\n of %s on %s points (%s)\"%(name, type_name,fit_condition))\n",
    "\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "input_path=os.path.join(SHOGUN_DATA_DIR, 'toy/banana.dat')\n",
    "plot_small_scale(input_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id=\"largescale\">A Glimpse of Large Scale Inference Methods</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id=\"difficulty\">Difficulty of Scaling up Standard GPC Models</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Readers may know that a GPC model is a <a href=\"http://en.wikipedia.org/wiki/Nonparametric_statistics#Non-parametric_models\">non-parametric model</a>. Indeed, a  GPC model is memory based and dependent on training data and summarized through the approximated posterior distribution, $q(\\mathbf{f}|\\mathbf{y})$, (a Gaussian distribution for variational Gaussian inference). In variational Gaussian inference for a **standard** GPC model, we must store an amount of information that increases with the number of training data, $n$. Recall that in variational Gaussian inference, we need to store mean (n-by-1) vector, $\\mu$, and covariance (n-by-n) matrix, $\\Sigma$, of $q(\\mathbf{f}|\\mathbf{y})$. We need to perform matrix inversion (or matrix factorization) related to the covariance matrix in the training process. In general, the time complexity of such matrix inversion is O($n^3$)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id=\"sparseGP\">Big Picture of Sparse GPC Models for Large Scale Inference</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The idea of <a href=\"http://www.jmlr.org/papers/v6/quinonero-candela05a.html\">sparse GP models</a> is to approximate the covariance matrix of prior distribution, $p(\\mathbf{f})$ with a <a href=\"http://en.wikipedia.org/wiki/Positive-definite_matrix\">positive definite matrix</a> that can reduce the time complexity in the training phrase.\n",
    "\n",
    "We pay attention to <a href=\"http://www.gatsby.ucl.ac.uk/~snelson/SPGP_up.pdf\">the inducing point</a> approach, where a set of inducing data, $\\mathbf{S}$, is used in the training process. We define $\\mathbf{S}\\in\\mathcal{R}^{m\\times d}$, where $m$ is the number of inducing points and $d$ is the number of features.  Under the assumption of <a href=\"http://papers.nips.cc/paper/3351-the-generalized-fitc-approximation.pdf\">Fully Independent Training Conditional</a> (FITC), the inducing point approach uses a \"diagonal plus low rank\" matrix to approximate the covariance matrix of the prior distribution. In the inducing point approach with FITC assumption, we can reduce the time complexity related to the covariance matrix of $q(\\mathbf{f}|\\mathbf{y})$ in variational Gaussian inference from O($n^3$) to O($n\\times m^2$) using the <a href=\"http://en.wikipedia.org/wiki/Binomial_inverse_theorem\">Binomial Inverse Theorem</a>.\n",
    "Note that $m \\le n$, where $n$ is the number of training points. What is more, when $\\mathbf{S}= \\mathbf{X}$, the sparse GPC model becomes a standard GPC model.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will cover more detail about large scale inference methods for GPC models in a new notebook.\n",
    "Here, we present a list of large scale inference methods to be implemented in Shogun.\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<th>Large Scale Inference Methods based on Sparse GPC Models</th>\n",
    "<th>Main idea</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>FITC Laplace Inference <a href=\"http://www.jmlr.org/papers/v6/quinonero-candela05a.html\">[1]</a> <a href=\"http://papers.nips.cc/paper/3351-the-generalized-fitc-approximation.pdf\">[2]</a> </td>\n",
    "<td>Conditional independence related to inducing data points, Laplace approximation</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>Parallel/Distributed Variational Inference<a href=\"http://arxiv.org/abs/1402.1389\">[1]</a>  <a href=\"http://arxiv.org/abs/1402.1412\">[2]</a></td>\n",
    "<td>Conditional independence related to inducing data points, Variational Gaussian inference\n",
    "</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>Stochastic Variational Inference<a href=\"http://auai.org/uai2013/prints/papers/244.pdf\">[1]</a> <a href=\"http://jmlr.org/proceedings/papers/v38/hensman15.pdf\">[2]</a></td>\n",
    "<td>Conditional independence related to inducing data points, Stochastic natural gradient, Variational Gaussian inference</td>\n",
    "</tr>\n",
    "\n",
    "\n",
    "<tr>\n",
    "<td>Nonparametric Variational Inference<a href=\"http://machinelearning.wustl.edu/mlpapers/papers/ICML2012Gershman_360\">[1]</a> <a href=\"http://papers.nips.cc/paper/5374-a-safe-screening-rule-for-sparse-logistic-regression\">[2]</a></td>\n",
    "<td>Conditional independence related to inducing data points, Nonparametric Variational inference ($q(\\cdot)$ is Mixture of Gaussians)\n",
    "</td>\n",
    "</tr>\n",
    "\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def train_large_scale(inference, linesearch, likelihood, x_train, y_train, x_inducing, kernel_log_sigma, kernel_log_scale, optimizing_inducing_points):\n",
    "    \n",
    "    \"\"\"\n",
    "    This function is used to train a GP classifer given an inference method\n",
    "        \n",
    "    Parameters:\n",
    "        inference - an inference methods used to train the classifer\n",
    "        linesearch - a linesearch method used in the inference method \n",
    "        likelihood - likelihood function to model data\n",
    "        x_train, x_test - features for training and testing\n",
    "        y_train - labels for training\n",
    "        kernel_log_sigma, kernel_log_scale hyper-parameters of covariance function    \n",
    "        optimizing_inducing_features - whether we optimize inducing features\n",
    "    Returns:\n",
    "        trained GPC model, name of inference method\n",
    "    \"\"\"\n",
    "\n",
    "    mean_func = ZeroMean()\n",
    "    kernel_sigma = exp(kernel_log_sigma);\n",
    "    kernel_func = GaussianARDSparseKernel(10)\n",
    "    kernel_func.set_scalar_weights(1.0/kernel_sigma)\n",
    "    \n",
    "\n",
    "    #Y is a sample-by-1 vector\n",
    "    labels_train = BinaryLabels(y_train)\n",
    "    #X is a feature-by-sample matrix\n",
    "    features_train=RealFeatures(x_train)\n",
    "    features_inducing=RealFeatures(x_inducing)\n",
    "    \n",
    "    inf = inference(kernel_func, features_train, mean_func, labels_train, likelihood, features_inducing)\n",
    "    inf.set_scale(exp(kernel_log_scale))\n",
    "    if optimizing_inducing_points:\n",
    "        inf.enable_optimizing_inducing_features(True, LBFGSMinimizer())\n",
    "        inf.set_tolearance_for_inducing_features(1e-3)\n",
    "        inf.set_max_iterations_for_inducing_features(20)\n",
    "    try:\n",
    "        #setting parameters for inducing points\n",
    "        inf.set_inducing_noise(1e-6);\n",
    "    except:\n",
    "        pass\n",
    "    gp = GaussianProcessClassification(inf)\n",
    "    gp.train()\n",
    "    return gp, inf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def plot_helper(plots,x1_boundary,x2_boundary,x_prbs,n_boundary,name,x1s,x2s,info=\"\"):   \n",
    "    \"\"\"\n",
    "    This helper function is used for plot decision boundary\n",
    "        \n",
    "    Parameters:\n",
    "        plots - a list of plots\n",
    "        x1_boundary, x2_boundary -features of boundary points used for determining decision boundary\n",
    "        x_prbs -probabilities of being positive labels of the boundary points\n",
    "        n_boundary -number of boundary points\n",
    "        name -name of inference method\n",
    "        x1s, x2s - training and testing points (different from boundary points)\n",
    "        info - string used in plot titles\n",
    "    \"\"\"\n",
    "        \n",
    "    for choice in [0,1]:\n",
    "        #decision boundary\n",
    "        plots[choice].contour(x1_boundary, x2_boundary, np.reshape(x_prbs, (n_boundary, n_boundary)).T, levels=[0.5], colors=('blue')) \n",
    "       \n",
    "        \n",
    "        #training points or testing points with true positive tag\n",
    "        plots[choice].scatter(x1s[choice][1],x2s[choice][1], c='red', alpha=0.5)\n",
    "        #training points or testing points with true negative tag\n",
    "        plots[choice].scatter(x1s[choice][0],x2s[choice][0], c='yellow', alpha=0.5)\n",
    "        plots[choice].set_xlabel(\"At1\")\n",
    "        plots[choice].set_ylabel(\"At2\")\n",
    "        plots[choice].axis(\"equal\")\n",
    "        type_name=\"training\"\n",
    "        if choice==1:\n",
    "            type_name=\"testing\"                         \n",
    "        plots[choice].set_title(\"Decision boundary (blue) \\n of %s\\n%s on %s points\"%(name, info, type_name))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from itertools import product\n",
    "def plot_large_scale(input_path):\n",
    "    \"\"\"\n",
    "    This function is used to plot the decision bounday and training and testing points\n",
    "        \n",
    "    Parameters:\n",
    "        input_path - the path of the raw data\n",
    "    \"\"\"\n",
    "    #obtain data points with labels\n",
    "    (x_train, y_train, x_test, y_test, x1s, x2s)=extract_banana_dataset(input_path)\n",
    "    print(\"%d training points\"%(len(x_train[0])))\n",
    "    #we want to compare two inference methods\n",
    "    inferences =[\n",
    "                SingleFITCLaplaceInferenceMethod, #inference method for sparse Gaussian processes\n",
    "                SingleLaplaceInferenceMethod,    #inference method for full Gaussian processes\n",
    "                ]\n",
    "    likelihood = LogitLikelihood()\n",
    "    linesearch = 3\n",
    "    kernel_log_sigma=0\n",
    "    kernel_log_scale=0\n",
    "    \n",
    "    #plot setting\n",
    "    col_size=8\n",
    "    f, plots =plt.subplots(2,2,figsize=(col_size*2,col_size*2))\n",
    "    \n",
    "    #generating boundary points\n",
    "    #these boundary points are used to generate decision boundaries\n",
    "    n_boundary=20\n",
    "    x1_boundary =np.linspace(x_train[0,:].min(), x_train[0,:].max(), n_boundary)\n",
    "    x2_boundary =np.linspace(x_train[1,:].min(), x_train[1,:].max(), n_boundary)\n",
    "    x_boundary = np.asarray(list(product(x1_boundary, x2_boundary))).T\n",
    "\n",
    "    features_boundary=RealFeatures(x_boundary)\n",
    "\n",
    "    #generate inducing points for sparse Gaussian process models\n",
    "    #note that we use a few inducing points instead of all training points\n",
    "    n_inducing=20  \n",
    "    x1_inducing = np.random.rand(n_inducing)*(x_train[0,:].max()-x_train[0,:].min())+x_train[0,:].min()\n",
    "    x2_inducing = np.random.rand(n_inducing)*(x_train[1,:].max()-x_train[1,:].min())+x_train[1,:].min()\n",
    "    x_inducing=np.row_stack([x1_inducing,x2_inducing])\n",
    "    print(\"%d inducing points\"%(n_inducing))\n",
    "    #for measuring runtime\n",
    "    import time\n",
    "    start=time.time()\n",
    "    #FITC Laplace approximation (inference method for sparse Gaussian processes)\n",
    "    (gpc, inf)=train_large_scale(inferences[0], linesearch, likelihood, x_train, y_train, x_inducing, kernel_log_sigma, kernel_log_scale, False)\n",
    "    name=inf.get_name()\n",
    "    prbs=gpc.get_probabilities(features_boundary)\n",
    "    print(\"FITC Laplace inference took %.2f seconds\" % (time.time()-start))\n",
    "    plot_helper(plots[0],x1_boundary,x2_boundary,prbs,n_boundary,name,x1s,x2s,\"with %d unoptimized inducing points\"%n_inducing)\n",
    "    \n",
    "    #plot the inducing points used in sparse Gaussian process models\n",
    "    plots[0][0].scatter(x1_inducing, x2_inducing, marker=(5,1), c='green',s=60)\n",
    "    plots[0][1].scatter(x1_inducing, x2_inducing, marker=(5,1), c='green',s=60)\n",
    "    \n",
    "    plots[0][0].legend([\"positive points\",\"negative points\", \"inducing points\"],scatterpoints=1)\n",
    "    plots[0][1].legend([\"positive points\",\"negative points\", \"inducing points\"],scatterpoints=1)\n",
    "    \n",
    "    start=time.time()\n",
    "    #Laplace approximation (inference method for full Gaussian processes)\n",
    "    (gpc, name)=train_small_scale(inferences[1], linesearch, likelihood, x_train, y_train, kernel_log_sigma, kernel_log_scale)\n",
    "    prbs=gpc.get_probabilities(features_boundary)\n",
    "    print(\"Laplace inference took %.2f seconds\" % (time.time()-start))\n",
    "    plot_helper(plots[1],x1_boundary,x2_boundary,prbs,n_boundary,name,x1s,x2s)\n",
    "\n",
    "    plots[1][0].legend([\"positive points\",\"negative points\"],scatterpoints=1)\n",
    "    plots[1][1].legend([\"positive points\",\"negative points\"],scatterpoints=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Since inducing points are generated by random, decision boundary of SingleFITCLaplace could be different if random seed is different\n",
    "np.random.seed(8) \n",
    "input_path=os.path.join(SHOGUN_DATA_DIR, 'toy/banana.dat')\n",
    "plot_large_scale(input_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remark: SingleFITCLaplacian is the implementation of the FITC Laplace inference method in Shogun. We can make the following observations:\n",
    "* FITC Laplace inference method improves the efficiency in the training phrase.\n",
    "* The loss of accuracy in prediction may be acceptable given the huge improvement in time.\n",
    "* Good inducing points plays a significant role in accuracy.\n",
    "* Selecting reasonable inducing points may be not trivial.\n",
    "\n",
    "Indeed, inducing points can be a subset of training points or can be obtained from optimizing some criterion (eg, marginal approximated log likelihood). We will cover this in the following section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id=\"optInd\">Optimizing Inducing Points</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from itertools import product\n",
    "def plot_large_scale_with_inducing_points(input_path,n_inducing):\n",
    "    \"\"\"\n",
    "    This function is used to plot the decision bounday and training and testing points\n",
    "        \n",
    "    Parameters:\n",
    "        input_path - the path of the raw data\n",
    "        n_inducing - number of inducing points\n",
    "    \"\"\"\n",
    "    #obtain data points with labels\n",
    "    (x_train, y_train, x_test, y_test, x1s, x2s)=extract_banana_dataset(input_path)\n",
    "    print(\"%d training points\"%(len(x_train[0])))\n",
    "    #we want to compare two inference methods\n",
    "    inferences =[\n",
    "                SingleFITCLaplaceInferenceMethod, #inference method for sparse Gaussian processes\n",
    "                ]\n",
    "    likelihood = LogitLikelihood()\n",
    "    linesearch = 3\n",
    "    kernel_log_sigma=0\n",
    "    kernel_log_scale=0\n",
    "    \n",
    "    #plot setting\n",
    "    col_size=8\n",
    "    f, plots =plt.subplots(2,2,figsize=(col_size*2,col_size*2))\n",
    "    \n",
    "    #generating boundary points\n",
    "    #these boundary points are used to generate decision boundaries\n",
    "    n_boundary=20\n",
    "    x1_boundary =np.linspace(x_train[0,:].min(), x_train[0,:].max(), n_boundary)\n",
    "    x2_boundary =np.linspace(x_train[1,:].min(), x_train[1,:].max(), n_boundary)\n",
    "    x_boundary = np.asarray(list(product(x1_boundary, x2_boundary))).T\n",
    "\n",
    "    features_boundary=RealFeatures(x_boundary)\n",
    "\n",
    "    #generate inducing points for sparse Gaussian process models\n",
    "    #note that we use a few inducing points instead of all training points\n",
    "    x1_inducing = np.random.rand(n_inducing)*(x_train[0,:].max()-x_train[0,:].min())+x_train[0,:].min()\n",
    "    x2_inducing = np.random.rand(n_inducing)*(x_train[1,:].max()-x_train[1,:].min())+x_train[1,:].min()\n",
    "    x_inducing=np.row_stack([x1_inducing,x2_inducing])\n",
    "    print(\"%d inducing points\"%(n_inducing))\n",
    "\n",
    "    #FITC Laplace approximation (inference method for sparse Gaussian processes)\n",
    "    (gpc, inf)=train_large_scale(inferences[0], linesearch, likelihood, x_train, y_train, x_inducing, kernel_log_sigma, kernel_log_scale, False)\n",
    "    name=inf.get_name()\n",
    "    prbs=gpc.get_probabilities(features_boundary)\n",
    "    plot_helper(plots[0],x1_boundary,x2_boundary,prbs,n_boundary,name,x1s,x2s,\"without optimizing %d inducing points\"%n_inducing) \n",
    "    x_inducing=inf.get_inducing_features().get_computed_dot_feature_matrix()\n",
    "\n",
    "\n",
    "    #plot the inducing points used in sparse Gaussian process models without optimizing inducing points\n",
    "    plots[0][0].scatter(x_inducing[0], x_inducing[1], marker=(5,1), c='green',s=60)\n",
    "    plots[0][1].scatter(x_inducing[0], x_inducing[1], marker=(5,1), c='green',s=60)\n",
    "    \n",
    "    plots[0][0].legend([\"positive points\",\"negative points\", \"raw inducing points\"],scatterpoints=1)\n",
    "    plots[0][1].legend([\"positive points\",\"negative points\", \"raw inducing points\"],scatterpoints=1)\n",
    "    \n",
    "    \n",
    "    #plot the inducing points used in sparse Gaussian process models with optimizing inducing points\n",
    "    (gpc, inf)=train_large_scale(inferences[0], linesearch, likelihood, x_train, y_train, x_inducing, kernel_log_sigma, kernel_log_scale, True)\n",
    "    name=inf.get_name()\n",
    "    prbs=gpc.get_probabilities(features_boundary)\n",
    "    plot_helper(plots[1],x1_boundary,x2_boundary,prbs,n_boundary,name,x1s,x2s, \"with optimized %d inducing points\"%n_inducing)\n",
    "    x_inducing=inf.get_inducing_features().get_computed_dot_feature_matrix()  \n",
    "    \n",
    "    plots[1][0].scatter(x_inducing[0], x_inducing[1], marker=(5,1), c='green',s=60)\n",
    "    plots[1][1].scatter(x_inducing[0], x_inducing[1], marker=(5,1), c='green',s=60)\n",
    "    \n",
    "    plots[1][0].legend([\"positive points\",\"negative points\", \"optimized inducing points\"],scatterpoints=1)\n",
    "    plots[1][1].legend([\"positive points\",\"negative points\", \"optimized inducing points\"],scatterpoints=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Since inducing points are generated by random, decision boundary of SingleFITCLaplace could be different if random seed is different\n",
    "np.random.seed(3) \n",
    "input_path=os.path.join(SHOGUN_DATA_DIR, 'toy/banana.dat')\n",
    "n_inducing_points=8\n",
    "plot_large_scale_with_inducing_points(input_path,n_inducing_points)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id=\"realeg1\">A Real-World Example (sonar dataset)</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We train GP classifiers for the sonar data set. The data set can be found at <a href=\"https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/\">here</a>. This is a binary classification problem. The goal of this example is to compare all inference methods of full Gaussian processes implemented in shogun in term of making prediction predictions. This example is about reproducing the example at page 29 of the paper, <a href=\"http://www.jmlr.org/papers/volume9/nickisch08a/nickisch08a.pdf\">Approximations for binary Gaussian process classification</a>. Please refer the paper about the definition of the information bit measure. The following is the description of the data set.\n",
    "\n",
    "The class \"sonar.mines\" contains 111 data points  obtained by bouncing sonar\n",
    "signals off a metal cylinder at various angles and under various\n",
    "conditions.  The class \"sonar.rocks\" contains 97 data points obtained from\n",
    "rocks under similar conditions.  The transmitted sonar signal is a\n",
    "frequency-modulated chirp, rising in frequency.  The data set contains\n",
    "signals obtained from a variety of different aspect angles, spanning 90\n",
    "degrees for the cylinder and 180 degrees for the rock.\n",
    "Each data point is a set of 60 features in the range 0.0 to 1.0.  Each feature\n",
    "represents the energy within a particular frequency band, integrated over\n",
    "a certain period of time.  The integration aperture for higher frequencies\n",
    "occur later in time, since these frequencies are transmitted later during\n",
    "the chirp.\n",
    "The label associated with each record contains the letter \"R\" if the object\n",
    "is a rock and \"M\" if it is a mine (metal cylinder).  The features in the\n",
    "labels are in increasing order of aspect angle, but they do not encode the\n",
    "angle directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def learning_example1(inference, minimizer, linesearch, likelihood, x_train, y_train, x_test, y_test, kernel_log_sigma, kernel_log_scale):\n",
    "    \n",
    "    \"\"\"\n",
    "    This function is used to train a GP classifer given an inference method\n",
    "        \n",
    "    Parameters:\n",
    "        inference - an inference methods used to train the classifer\n",
    "        minimizer - minimizer used in inference\n",
    "        linesearch - line search used in minimizer\n",
    "        likelihood - likelihood function to model data\n",
    "        x_train, x_test - X for training and testing\n",
    "        y_train, y_test - Y for training and testing\n",
    "        kernel_log_sigma, kernel_log_scale hyper-parameters of covariance function    \n",
    "    Returns:\n",
    "        predictive result of the testing data set, name of inference method\n",
    "    \"\"\"\n",
    "\n",
    "    mean_func = ZeroMean()\n",
    "    kernel_sigma = 2*exp(2*kernel_log_sigma);\n",
    "    kernel_func = GaussianKernel(10, kernel_sigma)\n",
    "\n",
    "    #Y is a sample-by-1 vector\n",
    "    labels_train = BinaryLabels(y_train)\n",
    "    labels_test = BinaryLabels(y_test)\n",
    "    #X is a feature-by-sample matrix\n",
    "    features_train=RealFeatures(x_train)\n",
    "    features_test=RealFeatures(x_test)\n",
    "\n",
    "    inf = inference(kernel_func, features_train, mean_func, labels_train, likelihood)\n",
    "    inf.set_scale(exp(kernel_log_scale))\n",
    "    try:\n",
    "    #used to make sure the kernel matrix is positive definite\n",
    "        inf.set_noise_factor(1e-6)\n",
    "        inf.set_min_coeff_kernel(1e-5)\n",
    "        inf.set_max_attempt(10)\n",
    "    except:\n",
    "        pass\n",
    "    if minimizer !=None:\n",
    "        opt=minimizer()\n",
    "        opt.set_lbfgs_parameters(100,2000,linesearch,2000)\n",
    "        inf.register_minimizer(opt);\n",
    "\n",
    "    gp = GaussianProcessClassification(inf)\n",
    "    gp.train()\n",
    "    prob=gp.get_probabilities(features_test)\n",
    "\n",
    "    return prob, inf.get_name()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "labels={\n",
    "    \"m\":1,\n",
    "    \"r\":-1,\n",
    "       }\n",
    "\n",
    "def extract(inf, train_size):\n",
    "    \"\"\"\n",
    "    This function is used to processing raw data\n",
    "        \n",
    "    Parameters:\n",
    "        inf - the path of the raw data\n",
    "        train_size - number of data points used for testing   \n",
    "        \n",
    "    Returns:\n",
    "        x_train, x_test - X for training and testing\n",
    "        y_train, y_test - Y for training and testing\n",
    "        B - for computing the information bit\n",
    "    \"\"\"\n",
    "\n",
    "    random.seed(1)\n",
    "    x=[]\n",
    "    y=[]\n",
    "    for line in open(inf):\n",
    "        line=line.strip()\n",
    "        info=line.split(',')\n",
    "        label=labels[info[-1].lower()]\n",
    "        x.append(list(map(float, info[:-1])))\n",
    "        y.append(label)\n",
    "        \n",
    "    #train_size should be less than the size of all available data points\n",
    "    assert train_size < len(x) \n",
    "    idx=[i for i in range(len(y))]\n",
    "    random.shuffle(idx)\n",
    "    train_idx=set(idx[:train_size])\n",
    "    test_idx=set(idx[train_size:])\n",
    "    x_train = np.asarray([value for (idx, value) in enumerate(x) if idx in train_idx]).T\n",
    "    y_train = np.asarray([label for (idx, label) in enumerate(y) if idx in train_idx])\n",
    "    x_test = np.asarray([value for (idx, value) in enumerate(x) if idx in test_idx]).T\n",
    "    y_test = np.asarray([label for (idx, label) in enumerate(y) if idx in test_idx])\n",
    "\n",
    "    y_train_positive=(y_train==1).sum()\n",
    "    y_train_negative=(y_train==-1).sum()\n",
    "    y_test_positive=(y_test==1).sum()\n",
    "    y_test_negative=(y_test==-1).sum()\n",
    "\n",
    "    prb_positive=float(y_train_positive)/len(y_train)\n",
    "    B=float(y_test_positive)*log(prb_positive,2)+float(y_test_negative)*log(1.0-prb_positive,2)\n",
    "    B=-B/len(y_test)\n",
    "\n",
    "    return x_train, y_train, x_test, y_test, B"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def approx_bit_plot(methods, minimizers, linesearches, likelihoods, x_train, y_train, x_test , y_test, plots, lScale, lSigma, B):\n",
    "    \"\"\"\n",
    "    This function is used to plot information bit figures\n",
    "        \n",
    "    Parameters:\n",
    "        methods - a list of inference methods used to train the classifer\n",
    "        minimizers - a list of minimizers used in the inference method\n",
    "        linesearches - a list of line search methods used in minimizer\n",
    "        likelihood - a list of likelihood functions to model data\n",
    "        x_train, x_test - X for training and testing\n",
    "        y_train, y_test - Y for training and testing\n",
    "        plots  - plot objects\n",
    "        lScale, lSigma  - a list of hyper-parameters of covariance function    \n",
    "        B - for computing information bits\n",
    "        \n",
    "    Returns:\n",
    "        predictive result of the testing data set\n",
    "    \"\"\"\n",
    "    #Note that information bit is used to measure effectiveness of an inference method\n",
    "    if len(plots.shape)==1:\n",
    "        rows=1\n",
    "        cols=plots.shape[0]\n",
    "    else:\n",
    "        (rows, cols) = plots.shape\n",
    "\n",
    "    methods=np.asarray(methods).reshape(rows, cols)\n",
    "    minimizers=np.asarray(minimizers).reshape(rows, cols)\n",
    "    linesearches=np.asarray(linesearches).reshape(rows, cols)\n",
    "    likelihoods=np.asarray(likelihoods).reshape(rows, cols) \n",
    "\n",
    "    for r in range(rows):\n",
    "        for c in range(cols):\n",
    "            inference = methods[r][c]\n",
    "            minimizer = minimizers[r][c]\n",
    "            likelihood = likelihoods[r][c]\n",
    "            linesearch = linesearches[r][c]\n",
    "            try:\n",
    "                likelihood.set_noise_factor(1e-15)\n",
    "                likelihood.set_strict_scale(0.01)\n",
    "            except:\n",
    "                pass\n",
    "            scores=[]\n",
    "            for items in zip(lScale, lSigma):\n",
    "                for parameters in zip(items[0],items[1]):\n",
    "                    lscale=parameters[0]\n",
    "                    lsigma=parameters[1]\n",
    "                    (prob, inf_name)=learning_example1(inference, minimizer, linesearch, likelihood, x_train, y_train, x_test,  y_test, \n",
    "                                              lscale, lsigma)\n",
    "                    #compute the information bit\n",
    "                    score=0.0\n",
    "                    for (label_idx, prb_idx) in zip(y_test, prob):\n",
    "                        if label_idx==1:\n",
    "                            score+=(1.0+label_idx)*log(prb_idx,2)\n",
    "                        else: #label_idx==-1:\n",
    "                            score+=(1.0-label_idx)*log(1.0-prb_idx,2)\n",
    "                    score=score/(2.0*len(y_test))+B\n",
    "                    scores.append(score)\n",
    "            scores=np.asarray(scores).reshape(len(lScale),len(scores)/len(lScale))\n",
    "\n",
    "            if len(plots.shape)==1:\n",
    "                sub_plot=plots\n",
    "            else:\n",
    "                sub_plot=plots[r]\n",
    "            CS =sub_plot[c].contour(lScale, lSigma, scores)\n",
    "            sub_plot[c].clabel(CS, inline=1, fontsize=10)\n",
    "            sub_plot[c].set_title('information bit for %s'%inf_name)\n",
    "            sub_plot[c].set_xlabel('log_scale')\n",
    "            sub_plot[c].set_ylabel('log_sigma')\n",
    "            sub_plot[c].axis('equal')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#an example for the sonar data set\n",
    "train_size=108\n",
    "(x_train, y_train, x_test, y_test, B)=extract(os.path.join(SHOGUN_DATA_DIR, 'uci/sonar/sonar.all-data'), train_size)\n",
    "inference_methods =[\n",
    "                  KLDualInferenceMethod,\n",
    "                  SingleLaplaceInferenceMethod,\n",
    "                  SingleLaplaceInferenceMethod,\n",
    "                  KLDiagonalInferenceMethod,\n",
    "                  #KLCholeskyInferenceMethod, #this method takes too long to run\n",
    "                  #KLCovarianceInferenceMethod, #this method takes too long to run\n",
    "                  ]\n",
    "\n",
    "likelihoods =[\n",
    "            LogitDVGLikelihood(), #KLDual method uses a likelihood class that supports dual variational inference\n",
    "            LogitLikelihood(),        \n",
    "            LogitVGLikelihood(), #KL method uses a likelihood class that supports variational inference\n",
    "            LogitVGLikelihood(), #KL method uses a likelihood class that supports variational inference\n",
    "            #LogitVGLikelihood(), #KL method uses a likelihood class that supports variational inference\n",
    "            #LogitVGLikelihood(), #KL method uses a likelihood class that supports variational inference\n",
    "            ]\n",
    "\n",
    "minimizers =[\n",
    "            KLDualInferenceMethodMinimizer,\n",
    "            None, #using default minimizer\n",
    "            LBFGSMinimizer,\n",
    "            LBFGSMinimizer,\n",
    "            #LBFGSMinimizer,\n",
    "            #LBFGSMinimizer,\n",
    "            ]\n",
    "\n",
    "linesearches=[\n",
    "            BACKTRACKING_ARMIJO,\n",
    "            None, #using default line search method\n",
    "            BACKTRACKING_STRONG_WOLFE,\n",
    "            BACKTRACKING_STRONG_WOLFE,\n",
    "            #BACKTRACKING_STRONG_WOLFE,\n",
    "            #BACKTRACKING_STRONG_WOLFE,\n",
    "            ]\n",
    "\n",
    "col_size=8\n",
    "lscale_min=0.0\n",
    "lscale_max=5.0\n",
    "lsigma_min=0.0\n",
    "lsigma_max=5.0\n",
    "delta=0.5\n",
    "scale=5.0\n",
    "\n",
    "lscale_list = np.arange(lscale_min, lscale_max, delta)\n",
    "lsigma_list = np.arange(lsigma_min, lsigma_max, delta*scale)\n",
    "lScale, lSigma = np.meshgrid(lscale_list, lsigma_list)\n",
    "width=len(likelihoods)/2\n",
    "f, plots =plt.subplots(width, 2, figsize=(col_size*2,col_size*width))\n",
    "\n",
    "approx_bit_plot(inference_methods, minimizers, linesearches, likelihoods, x_train, y_train,  x_test, y_test, plots, lScale, lSigma, B)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the information bit is used to measure the accuracy of classification. \n",
    "If there is a perfect classification, the information bit should be 1. \n",
    "From these figures, we can observe that the Laplace method is worst among these methods in term of information bit. The EP method, the KLCholesky method and the KLCovariance method seem to be equally good. Surprisingly, for this data set, the KLApproxDiagonal method looks as good as the EP method in term of information bit. Note that in the KLApproxDiagonal method non-diagonal entries of the posterior covariance matrix are all zeros.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id=\"realeg2\">Another Real-world Example (USPS dataset)</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the US Postal Service (USPS) database of handwritten digits as another example. The dataset refers to numeric data obtained from the scanning of handwritten digits from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been deslanted and size normalized, resulting in 16 x 16 grayscale images.\n",
    "\n",
    "We consider to classify the images of digit 3 from images of digit 5 as demonstration, which means only images of digit 3 and digit 5 are used. Note that this is example is also used in section 3.7.3 of the textbook, <a href=\"http://www.gaussianprocess.org/gpml/\">Gaussian Processes for Machine Learning</a>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def binary_extract(idx, features, labels):\n",
    "    \"\"\"\n",
    "    This function is used to extract a trainning set and a test set\n",
    "    \n",
    "    Parameters:\n",
    "        idx - a 2-wide list of digits (eg, [0,3] means to extract a sub set of images and labels for digit 3 and digit 0) \n",
    "        features - the whole image set \n",
    "        labels - the whole label set\n",
    "        \n",
    "    Returns:\n",
    "        binary_features - X (images for binary classification)\n",
    "        binary_labels - y (labels for binary classification)\n",
    "    \"\"\"\n",
    "    #binary classification\n",
    "    assert len(idx) == 2\n",
    "    positive_idx = (labels[idx[0],:] == 1)\n",
    "    negative_idx = (labels[idx[1],:] == 1)\n",
    "    binary_idx = (positive_idx | negative_idx)\n",
    "    ds = binary_idx.shape[-1]\n",
    "\n",
    "    bin_labels = np.zeros(ds) \n",
    "    bin_labels[positive_idx] = 1\n",
    "    bin_labels[negative_idx] = -1\n",
    "\n",
    "    binary_features = (features[:,binary_idx])\n",
    "    binary_labels = (bin_labels[binary_idx])\n",
    " \n",
    "    positive_count = bin_labels[positive_idx].shape[-1]\n",
    "    negative_count = bin_labels[negative_idx].shape[-1]\n",
    "    binary_count = binary_labels.shape[-1]\n",
    " \n",
    "    print(\"There are %d positive samples and %d negative samples\" %(positive_count, negative_count))\n",
    "    return binary_features, binary_labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def learning_example2(inference, likelihood, x_train, x_test, y_train, y_test, using_lbfgs=False):\n",
    "    \"\"\"\n",
    "    This function is used to train a GP classifer given an inference method\n",
    "        \n",
    "    Parameters:\n",
    "        inference - an inference methods used to train the classifer\n",
    "        likelihood - likelihood function to model data\n",
    "        x_train, x_test - X for training and testing\n",
    "        y_train, y_test - Y for training and testing \n",
    "        \n",
    "    Returns:\n",
    "        Nothing\n",
    "    \"\"\"\n",
    "        \n",
    "    #train a GP classifer\n",
    "    error_eval = ErrorRateMeasure()\n",
    "    mean_func = ConstMean()\n",
    "    #set hyper-parameters of covariance function\n",
    "    kernel_log_sigma = 1.0\n",
    "    kernel_sigma = 2*exp(2*kernel_log_sigma);\n",
    "    kernel_func = GaussianKernel(10, kernel_sigma)\n",
    "\n",
    "    #sample by 1\n",
    "    labels_train = BinaryLabels(y_train)\n",
    "    labels_test = BinaryLabels(y_test)\n",
    "    #feature by sample\n",
    "    features_train=RealFeatures(x_train)\n",
    "    features_test=RealFeatures(x_test)\n",
    "\n",
    "    kernel_log_scale = 1.0\n",
    "\n",
    "    inf = inference(kernel_func, features_train, mean_func, labels_train, likelihood)\n",
    "    print(\"\\nusing %s\"%inf.get_name())\n",
    "    \n",
    "    inf.set_scale(exp(kernel_log_scale))\n",
    "    \n",
    "    if using_lbfgs:\n",
    "        minimizer1 = LBFGSMinimizer()\n",
    "        minimizer1.set_lbfgs_parameters(100,80,BACKTRACKING_STRONG_WOLFE,80)\n",
    "        inf.register_minimizer(minimizer1);\n",
    "    \n",
    "\n",
    "    start = time.time()\n",
    "    gp = GaussianProcessClassification(inf)\n",
    "    gp.train()\n",
    "    end = time.time()\n",
    "    print(\"cost %.2f seconds at training\"%(end-start))\n",
    "    nlz=inf.get_negative_log_marginal_likelihood()\n",
    "    print(\"the negative_log_marginal_likelihood is %.4f\"%nlz)\n",
    "    start = time.time()\n",
    "    #classification on train_data\n",
    "    pred_labels_train = gp.apply_binary(features_train)\n",
    "    #classification on test_data\n",
    "    pred_labels_test = gp.apply_binary(features_test)\n",
    "    end = time.time()    \n",
    "    print(\"cost %.2f seconds at prediction\"%(end-start))\n",
    "    \n",
    "    error_train = error_eval.evaluate(pred_labels_train, labels_train)\n",
    "    error_test = error_eval.evaluate(pred_labels_test, labels_test)\n",
    "    \n",
    "    print(\"Train error : %.2f Test error: %.2f\\n\" % (error_train, error_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#an example for the USPS data set\n",
    "data=scipy.io.loadmat(os.path.join(SHOGUN_DATA_DIR, 'toy/usps/usps_resampled/usps_resampled.mat'))\n",
    "train_labels=data['train_labels']\n",
    "test_labels=data['test_labels']\n",
    "train_features=data['train_patterns']\n",
    "test_features=data['test_patterns']\n",
    "#using images of digit 3 and digit 5 from the dataset\n",
    "idx=[3,5]\n",
    "#Note that \n",
    "#y_train and y_test are followed the definition in the first section\n",
    "#the transpose of x_train and x_test are followed the definition in the first section\n",
    "print(\"Training set statistics\")\n",
    "(x_train, y_train)=binary_extract(idx,train_features, train_labels)\n",
    "print(\"Test set statistics\")\n",
    "(x_test, y_test)=binary_extract(idx,test_features, test_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id='laplace'>Using Laplace Method</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"http://en.wikipedia.org/wiki/Laplace%27s_method\">The Laplace method</a> can be viewed as a special case of variational method.\n",
    "The idea of the Laplace method is to use the second-order <a href=\"http://en.wikipedia.org/wiki/Taylor_series\">Taylor  expansion</a> in the <a href=\"http://en.wikipedia.org/wiki/Mode_%28statistics%29\">mode</a>, $\\mathbf{\\hat{f}}$ of $p(\\mathbf{f}|\\mathbf{y})$ to approximate $p(\\mathbf{f}|\\mathbf{y})$, which is given below:\n",
    "\n",
    "$p(\\mathbf{f}|\\mathbf{y}) \\approx \\hat{p}(\\mathbf{f}|\\mathbf{y})$ is $N(\\mathbf{\\hat{f}}, H_{\\mathbf{\\hat{f}}}^{-1})$ ,where $H_{\\mathbf{\\hat{f}}}$ is the <a href=\"http://en.wikipedia.org/wiki/Hessian_matrix\">Hessian matrix</a> in $\\mathbf{\\hat{f}}$\n",
    "\n",
    "Therefore, the KL divergence, ${\\mathrm{KL}}(Q\\|P)$, is approximated by $\\int_{-\\infty}^\\infty \\ln\\left(\\frac{q(\\mathbf{f}|\\mathbf{y})}{\\hat{p}(\\mathbf{f}|y)}\\right) q(\\mathbf{f}|\\mathbf{y}) \\ {\\rm d}\\mathbf{f}$.\n",
    "\n",
    "By **minimizing** $\\int_{-\\infty}^\\infty \\ln\\left(\\frac{q(\\mathbf{f}|\\mathbf{y})}{\\hat{p}(\\mathbf{f}|y)}\\right) q(\\mathbf{f}|\\mathbf{y}) \\ {\\rm d}\\mathbf{f}$\n",
    "we can get $q(\\mathbf{f}|\\mathbf{y})= \\hat{p}(\\mathbf{f}|\\mathbf{y})$.\n",
    "In practice, we can use <a href=\"http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization\">Newton-Raphson</a> optimizer or <a href=\"http://en.wikipedia.org/wiki/Quasi-Newton_method\">Quasi-Newton</a> optimizers(ie, <a href=\"http://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm\">Limited-memory Broyden–Fletcher–Goldfarb–Shanno</a> (LBFGS)) to find $\\mathbf{\\hat{f}}$.\n",
    "Since the likelihood is the inverse-logit, we can show that the objective function, $\\ln(p(\\mathbf{f}|\\mathbf{y}))$, is strictly <a href=\"http://en.wikipedia.org/wiki/Concave\">concave</a> with respect to $\\mathbf{f}$, which means in theory these optimizers can find the same global solution $\\mathbf{\\hat{f}}$.\n",
    "We demonstrate how to apply the Laplace method via Newton-Raphson optimizer and LBFGS optimizer in Shogun to the USPS data set as below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "inference=SingleLaplaceInferenceMethod\n",
    "likelihood = LogitLikelihood()\n",
    "learning_example2(inference, likelihood, x_train, x_test, y_train, y_test, False) #using Newton method\n",
    "\n",
    "learning_example2(inference, likelihood, x_train, x_test, y_train, y_test, True) #using lbfgs method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remark:  In practice, the Laplace method is the fastest method among all implemented variational methods in Shogun. However, the second-order Taylor approximation in the mode of marginal likelihood is not always a good approximation, which can be observed in the figures in the previous section for the sonar data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id='covv'>Using Covariance Variational Method</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method is mentioned in <a href=\"http://jmlr.org/papers/volume9/nickisch08a/nickisch08a.pdf\">the paper</a> of Nickisch and Rasmussen in 2008, which is called the KL method in the paper.\n",
    "The idea of this method is to maximize $E_q[ln(p(\\mathbf{f}))] + E_q[ln(p(\\mathbf{y}|\\mathbf{f}))] - E_q[ln(q(\\mathbf{f}|\\mathbf{y}))]$ with respect to $\\mu$ and $\\Sigma$, by reparametrizing $\\Sigma$ to reduce the dimension of variables to be optimized.\n",
    "However, such parametrization does not preserve convexity (concave in this setting) according to <a href=\"http://arxiv.org/abs/1306.1052\">the paper</a> of Khan et. al. in 2013. We demonstrate how to apply this variational method in Shogun to the USPS example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "likelihood = LogitVGLikelihood()\n",
    "learning_example2(KLCovarianceInferenceMethod, likelihood, x_train, x_test, y_train, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remark: This method may be the first variational method used in GPC models. However, in practice, it is the slowest method among all implemented variational methods in Shogun."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id='meanfield'>Using Mean-field Variational Method</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method is also called the factorial variational method in <a href=\"http://jmlr.org/papers/volume9/nickisch08a/nickisch08a.pdf\">the paper</a> of Nickisch and Rasmussen in 2008.\n",
    "The idea of mean-field variatonal method is to enforce artificial structure on the co-variance matrix, $\\Sigma$ of $q(\\mathbf{f})$.\n",
    "In mean-field variatonal method, $\\Sigma$ is a diagonal positive-definite matrix.\n",
    "We demonstrate how to apply this variational method in Shogun to the USPS example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "likelihood = LogitVGLikelihood()\n",
    "learning_example2(KLDiagonalInferenceMethod, likelihood, x_train, x_test, y_train, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remark: This method could be as fast as the Laplace method in Shogun in practice but ignores all off-diagonal elements of the covariance matrix. However, in some case (ie, there are a lot of noise in data set), such structure regularization may bring some promising result compared to other variational methods, which can be observed in the previous section for the sonar data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id=\"chol\">Using Cholesky Variational Method</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method is proposed in <a href=\"http://jmlr.org/proceedings/papers/v15/challis11a/challis11a.pdf\">the paper</a> of Challis and Barber in 2011.\n",
    "The idea of this method is to maximize $E_q[\\ln(p(\\mathbf{f}))] + E_q[\\ln(p(\\mathbf{y}|\\mathbf{f}))] - E_q[\\ln(q(\\mathbf{f}|\\mathbf{y}))]$ in terms of the Cholesky representation, C, for the covariance matrix of $q(\\mathbf{f})$, where $\\Sigma=CC^T$ and C is a lower triangular matrix. This reparametrization helps ensure the positive-definiteness of the covariance matrix, which makes the optimization relatively easy. It also increases numerical stability.\n",
    "We demonstrate how to apply this variational method in Shogun to the USPS data set as below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "likelihood = LogitVGLikelihood()\n",
    "learning_example2(KLCholeskyInferenceMethod, likelihood, x_train, x_test, y_train, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remark: This method may be faster to learn the complete structure of the covariance matrix from data than covariance variational method.\n",
    "The reason is solving linear system related to $\\Sigma$ is required at each optimization step.\n",
    "In Cholesky representation, the time complexity at each optimization step is reduced to O(n^2) from O(n^3) compared to covariance variational method.\n",
    "However, the overall time complexity is still O(n^3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id=\"dual\">Using Dual Variational Method</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method is proposed in <a href=\"http://arxiv.org/abs/1306.1052\">the paper</a> of Khan et. al. in 2013.\n",
    "The idea of this method is to optimize $E_q[ln(p(\\mathbf{f}))] + E_q[ln(p(\\mathbf{y}|\\mathbf{f}))] - E_q[ln(q(\\mathbf{f}|\\mathbf{y}))]$ in <a href=\"http://en.wikipedia.org/wiki/Duality_%28optimization%29\">Lagrange dual</a> form instead of the primal form used in covariance variational method via explicitly expressing constraint in the covariance matrix, $\\Sigma$, in terms of auxiliary variable.\n",
    "However, the time complexity of each optimization step is still O(n^3) and variational bound usually is required to approximate $E_{q_i}[ln(p(\\mathbf{y_i}|\\mathbf{f_i})]$ for GPC in this method.\n",
    "We demonstrate how to apply this variational method in Shogun to the USPS data set as below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "likelihood = LogitDVGLikelihood()\n",
    "likelihood.set_strict_scale(0.1)\n",
    "learning_example2(KLDualInferenceMethod, likelihood, x_train, x_test, y_train, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remark: This method requires O(N) variables to be optimized while the Cholesky method requires O(N^2) variables to be optimized. This fact is important for large-scale cases. A promising observation in practice is that this method is faster than covariance variational method and may not be slower than the Cholesky method. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a id=\"bounds\">Likelihood Classes Supported Variational Bounds</a> <a href='#toc'>[TOP]</a> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a id=\"bgvb\">Some Background of Variational Bounds in GPC Models</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we discuss how to approximate $E_{q_i}[ln(p(\\mathbf{y_i}|\\mathbf{f_i})]$. Please see section <a href=\"#nonclosedForm\">Dealing with the Non-closed Form Issue in GPC Models</a> if you are not familar with variational bounds.\n",
    "Since this term is about one-dimensional integration, for GPC we can use <a href=\"http://en.wikipedia.org/wiki/Gauss%E2%80%93Hermite_quadrature\">Gauss–Hermite quadrature</a> to appromxiate this term. In Shogun, the corresponding implementation of Bernoulli-logistic likelihood for variational inference is called <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLogitVGLikelihood.html\">LogitVGLikelihood</a> . This likelihood can be used for all variational methods (primal  form) except the dual variational method (dual form).\n",
    "\n",
    "The piecewise variational bound proposed at <a href=\"http://www.icml-2011.org/papers/376_icmlpaper.pdf\">the paper</a> of Benjamin M. Marlin et. al. in 2011 is also implemented in Shogun, which is called <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLogitVGPiecewiseBoundLikelihood.html\">LogitVGPiecewiseBoundLikelihood</a>. This likelihood can be used for all variational methods except the dual variational method.\n",
    "\n",
    "For dual variational method, we use the variational bound in <a href=\"http://www.cs.cmu.edu/~lafferty/pub/ctm.pdf\">the paper</a> of DM Blei et. al. in 2007. The corresponding implementation in Shogun is called <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLogitDVGLikelihood.html\">LogitDVGLikelihood</a>. Note that when using LogitDVGLikelihood, the bound is only enabled for the dual variational method. When other variational methods use this class, it uses one-dimensional integration instead of the variational bound. For detailed discussion about variational bounds for GPC models, please refer to <a href=\"https://circle.ubc.ca/handle/2429/43640\">Khan's work</a>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "likelihood = LogitVGPiecewiseBoundLikelihood()\n",
    "likelihood.set_default_variational_bound()\n",
    "likelihood.set_noise_factor(1e-15)\n",
    "inference_methods=[\n",
    "                  KLCholeskyInferenceMethod,\n",
    "                  KLDiagonalInferenceMethod,\n",
    "                  #KLCovarianceInferenceMethod,\n",
    "                  ]\n",
    "for inference in inference_methods:\n",
    "    #using the default linesearch method\n",
    "    learning_example2(inference, likelihood, x_train, x_test, y_train, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a id='parameters'>Optimizing Parameters of GPC (Model Selection)</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We demonstrate how to optimize hypyer-parameters in Shogun by taining a GP classifier for the USPS example using the mean-field variational method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def learning_example3(inference, likelihood, x_train, x_test, y_train, y_test):\n",
    "    \"\"\"\n",
    "    This function is used to train a GP classifer given an inference method and hyper-parameters are automatically optimized\n",
    "        \n",
    "    Parameters:\n",
    "        inference - an inference methods used to train the classifer\n",
    "        likelihood - likelihood function to model data\n",
    "        x_train, x_test - X for training and testing\n",
    "        y_train, y_test - Y for training and testing\n",
    "        \n",
    "    Returns:\n",
    "        Nothing\n",
    "    \"\"\"\n",
    "    error_eval = ErrorRateMeasure()\n",
    "    mean_func = ZeroMean()\n",
    "    kernel_log_sigma = 1.0\n",
    "    kernel_sigma = 2*exp(2*kernel_log_sigma);\n",
    "    kernel_func = GaussianKernel(10, kernel_sigma)\n",
    "\n",
    "    #sample by 1\n",
    "    labels_train = BinaryLabels(y_train)\n",
    "    labels_test = BinaryLabels(y_test)\n",
    "    #feature by sample\n",
    "    features_train=RealFeatures(x_train)\n",
    "    features_test=RealFeatures(x_test)\n",
    "\n",
    "    kernel_log_scale = 1.0\n",
    "\n",
    "    inf = inference(kernel_func, features_train, mean_func, labels_train, likelihood)\n",
    "    print(\"\\nusing %s\"%inf.get_name())\n",
    "    \n",
    "    inf.set_scale(exp(kernel_log_scale))\n",
    "    \n",
    "    minimizer1 = LBFGSMinimizer()\n",
    "    minimizer1.set_lbfgs_parameters(100,80,BACKTRACKING_STRONG_WOLFE,80)\n",
    "    inf.register_minimizer(minimizer1);\n",
    "\n",
    "    gp = GaussianProcessClassification(inf)\n",
    "\n",
    "    # evaluate our inference method for its derivatives\n",
    "    grad = GradientEvaluation(gp, features_train, labels_train, GradientCriterion(), False)\n",
    "    grad.set_function(inf)\n",
    "\n",
    "    # handles all of the above structures in memory\n",
    "    grad_search = GradientModelSelection(grad)\n",
    "\n",
    "    # search for best parameters and store them\n",
    "    best_combination = grad_search.select_model()\n",
    "\n",
    "    # apply best parameters to GP\n",
    "    best_combination.apply_to_machine(gp)\n",
    "\n",
    "    # we have to \"cast\" objects to the specific kernel interface we used (soon to be easier)\n",
    "    best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()\n",
    "    best_scale=inf.get_scale()\n",
    "    print(\"Selected kernel bandwidth:\", best_width)\n",
    "    print(\"Selected kernel scale:\", best_scale)\n",
    "\n",
    "    start = time.time()\n",
    "    gp.train()\n",
    "    end = time.time()\n",
    "    print(\"cost %s seconds at training\"%(end-start))\n",
    "    nlz=inf.get_negative_log_marginal_likelihood()\n",
    "    print(\"the negative_log_marginal_likelihood is %.4f\"%nlz)\n",
    "    start = time.time()\n",
    "    #classification on train_data\n",
    "    pred_labels_train = gp.apply_binary(features_train)\n",
    "    #classification on test_data\n",
    "    pred_labels_test = gp.apply_binary(features_test)\n",
    "    end = time.time()    \n",
    "    print(\"cost %s seconds at prediction\"%(end-start))\n",
    "    \n",
    "    error_train = error_eval.evaluate(pred_labels_train, labels_train)\n",
    "    error_test = error_eval.evaluate(pred_labels_test, labels_test)\n",
    "    \n",
    "    print(\"Train error : %.2f Test error: %.2f\\n\" % (error_train, error_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "likelihood = LogitVGLikelihood()\n",
    "likelihood.set_noise_factor(1e-15)\n",
    "learning_example3(KLDiagonalInferenceMethod, likelihood, x_train, x_test, y_train, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a id=\"ref\">Reference</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Rasmussen, Carl Edward. \"Gaussian processes for machine learning.\" (2006).\n",
    "* Bishop, Christopher M. Pattern recognition and machine learning. Vol. 1. New York: springer, 2006.\n",
    "* Nickisch, Hannes, and Carl Edward Rasmussen. \"Approximations for binary Gaussian process classification.\" Journal of Machine Learning Research 9 (2008): 2035-2078.\n",
    "* Challis, Edward, and David Barber. \"Concave Gaussian variational approximations for inference in large-scale Bayesian linear models.\" International conference on Artificial Intelligence and Statistics. 2011.\n",
    "* Emtiyaz Khan, Mohammad, et al. \"Fast dual variational inference for non-conjugate latent gaussian models.\" Proceedings of The 30th International Conference on Machine Learning. 2013.\n",
    "* Marlin, Benjamin M., Mohammad Emtiyaz Khan, and Kevin P. Murphy. \"Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models.\" ICML. 2011.\n",
    "* Khan, Mohammad. \"Variational learning for latent Gaussian model of discrete data.\" (2012).\n",
    "* Wang, Chong, and David M. Blei. \"Variational inference in nonconjugate models.\" The Journal of Machine Learning Research 14.1 (2013): 1005-1031.\n",
    "* Paisley, John, David Blei, and Michael Jordan. \"Variational Bayesian inference with stochastic search.\" ICML. 2012."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a id='todo'>Soon to Come</a> <a href='#toc'>[TOP]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Large-scale inference for structured GPC models\n",
    "* Stochastic variational inference for sparse GPC models\n",
    "* GP latent variable models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
